Translate this page:
Please select your language to translate the article


You can just close the window to don't translate
Library
Your profile

Back to contents

Philology: scientific researches
Reference:

The Chekhov Digital project: semantic markup of a parallel corpus of translations of Chekhov's fiction into German

Severina Elena Mikhailovna

ORCID: 0000-0001-6518-2771

Doctor of Philosophy

Professor, Institute of Philology, Journalism and Intercultural Communication, Southern Federal University

344006, Russia, Rostov-On-Don, lane University, 93

emkovalenko@sfedu.ru
Other publications by this author
 

 
Fyodorov Nikita Aleksandrovich

ORCID: 0000-0002-7436-2202

Master's degree, Institute of Philology, Journalism and Intercultural Communication, Southern Federal University

344006, Russia, Rostov-On-Don, lane University, 93

nfyodorov@sfedu.ru

DOI:

10.7256/2454-0749.2024.4.70560

EDN:

PXMQSB

Received:

24-04-2024


Published:

06-05-2024


Abstract: The article discusses the issues of developing the principles of a semantically marked parallel corpus of translations of Chekhov's fiction into German within the framework of the Chekhov Digital project, a digital academic publication of the writer's collected works in TEI (Text Encoding Initiative) format. The parallel corpus project is focused on creating a digital infrastructure for studying the writer's works, allowing researchers to analyze and compare original texts with their translations. Difficulties were identified related to the interpretation of significant elements of the writer's works, the specifics of their translation into German and the semantic markup of translations of fiction, for example, difficulties arose with defining the boundaries and relationships between the elements of semantic markup. Ways to overcome them are proposed, including the use of digital methods and natural language processing technologies. The project uses digital methods and technologies of natural language processing, the standard of digital publication Text Encoding Initiative (TEI). The text markup structure based on the TEI standard makes documents machine-readable, which allows to develop tools for complex semantic information retrieval. The inclusion in the Chekhov Digital project of parallel corpora of translations of A. P. Chekhov's works into different languages makes it possible to expand research tools in the field of translation studies, making it possible to compare texts of translations and originals, detect similarities and differences in vocabulary, grammar, style and cultural references, as well as automate routine research processes, which makes search and analysis much more effective information on large volumes of texts. The results of the project will contribute to the development of the digital humanitarian environment, contributing to the preservation and popularization of the literary heritage of A.P. Chekhov. The creation of a semantically marked parallel corpus of translations will be important for literary critics, linguists and translators, allowing them to study the specifics of translations of Chekhov's works and develop new forms of text analysis and interpretation. The experience gained during the project will be valuable for future research and practical applications, demonstrating the effectiveness of digital technologies in humanitarian research and education.


Keywords:

Chekhov Digital project, Digital Edition, Chekhov, Parallel Corpora, Text Encoding Initiative, Machine-readable Markup, Semantic Search, Digital Technologies, Natural Language Processing, Parsing

This article is automatically translated.

Introduction

In modern humanitarian knowledge, digital technologies are becoming increasingly important, corresponding to global trends in digitalization of various spheres of human activity, on the one hand, and on the other hand, it allows the development of new forms and research approaches in the humanities. The development of the digital environment is changing the forms of existence of texts, forming new approaches to the formats of publications of literary texts that should become digital, ensuring not only the preservation of texts as cultural objects, but also access to them as carriers of cultural meanings and knowledge, which requires the transformation of philological knowledge into a digital machine-readable format. It is possible to provide such a representation of the text in the form of related data expressing a direct, explicit and understandable relationship of entities for computer processing within the framework of semantic markup. In this context, the creation of semantic editions of literary works contributes to the systematization of the creative heritage of writers and the optimization of the professional activities of philologists.

The Institute of Philology, Journalism and Intercultural Communication of the Southern Federal University, together with the Department of Humanitarian Studies of the Southern Scientific Center of the Russian Academy of Sciences and the International Laboratory of Language Convergence of the National Research University Higher School of Economics, are working on the Chekhov Digital project, which is developing a semantic digital edition of the texts of the Complete Works and letters of A. P. Chekhov in 30 volumes (PSSiP) [1] in accordance with the Text Encoding Initiative (TEI) digital publication standard [2]. Semantic machine-readable markup in TEI format has been developed for each text of the PSSiP, including editorial and critical materials, and work is also underway to create a digital index of names and titles mentioned in the texts of the academic publication, including comments and notes. The digital index is based on existing indexes, verified semi-automatically, supplemented with information about real people and objects from external databases such as Wikidata [3]. The main goal of the project is to make the semantic markup of texts publicly available, to develop accessible digital tools for working with PSSiP texts [4].

The Chekhov Digital project uses a text markup structure based on the TEI (Text Encoding Initiative) digital publishing standard [2], which makes documents machine-readable. Documents marked up in accordance with the TEI principles consist of two parts: a TEI header with source metadata (for example, publication description, title, author's name, text language, etc.), and a text module that includes marked up text information. TEI allows you to take into account the specifics of the text, for example, additional metadata for letters (addressee, date and place of writing, etc.), features of the presentation of information in the text, mark up named entities, biographical information and some social categories (social status, professional affiliation, etc.). Texts marked up in this way allow you to develop tools for complex semantic search information [4, p. 158].

As part of the research, the principles of semantic markup of translations of Chekhov's fiction into German have been developed within the framework of the Chekhov Digital project, which will allow creating a parallel corpus of translations in TEI format.

The inclusion in the Chekhov Digital project of parallel corpora of translations of A. P. Chekhov's works into different languages makes it possible to expand research tools in the field of translation studies, making it possible to compare texts of translations and originals, detect similarities and differences in vocabulary, grammar, style and cultural references, as well as automate routine research processes, which makes search and analysis much more effective information on large volumes of text. Thus, the inclusion in the Chekhov Digital project of a parallel corpus of A. P. Chekhov's short stories and their translations into German is an important and promising task that contributes to the development of linguistic, literary and translation studies.

Creating parallel corpora is a difficult task that requires solving a number of problems related to text alignment. Alignment is the process of matching text fragments in the source and target languages. Alignment can be performed either manually or with the help of automatic tools, but in any case this process requires significant time and resource costs.

As a rule, the correlation of texts in parallel corpora is performed according to sentences, however, this method is not always suitable for automatic markup: "One of the most significant difficulties of alignment is that the author's division of the text into sentences and paragraphs is not always maintained in the translation text" [5, p. 289]. Texts in different languages may have different structure and organization depending on a variety of linguistic factors. For example, sentences in one language may be longer or shorter than in another, or may contain different grammatical constructions. In German, there is a "frame construction" of a sentence [6, p. 141]: the verb occupies the final position, and the rest of the sentence elements are arranged around it; this construction is characteristic of the German language and differs from the sentence structure in many other languages, where the verb is usually located in the middle of the sentence. This can make the alignment process more difficult and less accurate. In addition, alignment may be complicated by the presence of idiomatic expressions, phraseological units and other linguistic features in the text that cannot be translated verbatim or compared with the text in another language.

Thus, text alignment is a necessary step in the process that allows you to compare and analyze texts with sufficient accuracy. Translations of Chekhov's short stories into German, selected by us for analysis and markup, represent a fairly accurate interpretation of the original texts, but they correlate better in paragraphs than in sentences, due to the above-mentioned features of the syntax of the German language.

By creating parallel text corpora equipped with semantic markup, researchers strive to automate and universalize the structure of this markup. The creators of the semantic edition of Chekhov Digital are guided by the significant vocabulary identified as a result of the analysis [7]. We believe that in order to optimize the markup of a parallel corpus, it is necessary to pay attention to the allocation of meaningful vocabulary for texts in two languages, which will allow solving some alignment difficulties.

Research material

84 texts of translations of Chekhov's short stories into German were collected on the pages of the Projekt Gutenberg-DE electronic library of German-language texts [8]. The collected corpus represents all the translations that are freely available in this electronic library. The original texts of the writer's works from the Complete Academic Collection of Works are presented within the framework of the Chekhov Digital project [9].

At the moment, the total corpus of the texts under study is 168 texts: 84 texts in Russian (original texts), 84 texts of translations made by A. Eliasberg (46), V. Chumikov (12), K. Holm (5); the authorship of 21 texts of translations is not indicated either on the Projekt Gutenberg-DE website or in other collections of translations of Chekhov's short stories that are freely available on the Internet [10].

Using stylometric analysis [11], elements of the author's style of different translators were identified in these texts, and therefore it is assumed that they were translated jointly [12].

The marked-up texts of the Chekhov Digital project are freely distributed under the Creative Commons Attribution Non-Commercial (CC BY-SA) license, i.e. free use of the work and its markup is allowed, provided that authorship is indicated and the terms of use are preserved [9]. The materials and markup of the Projekt Gutenberg-DE website are distributed under a Creative Commons Attribution Non-Commercial (CC BY-NC) license, i.e. free use of the work and its markup is allowed, subject to attribution, but only for non-commercial purposes.

Most of the stories studied belong to the early period of A. P. Chekhov's work – from 1883 to 1887: Volume 2 - 14 stories; volume 4 ? 12 stories; 5th - 18 stories; 6th – 19 stories.

Results and discussion

Digital technologies play an important role in the preparation of the corpus, allowing you to automate the process of collecting and systematizing text data. To solve this problem, parsing (English parsing; web scraping) was used ? automated collection and systematization of information from open sources using software [13, p. 33]. To solve this problem, the Beautiful Soup Python library was used, which also made it possible to check whether the text markup corresponds to certain templates developed in the project [7] and make the necessary adjustments.

On the other hand, the annotation of texts, that is, their markup according to grammatical, syntactic and semantic features, which makes the text machine-readable for automatic interpretation, is of great importance for creating a parallel corpus within the framework of the Chekhov Digital project. To identify significant elements from the text for this kind of markup, digital analysis methods were used: thematic modeling to determine a set of hidden themes in the studied stories by A. P. Chekhov, as well as their key vocabulary; analysis of the tonality of texts to identify emotional context; stylometric analysis to study similarities and differences in the styles of the author and translators of the writer's prose, research co-authorship in translations of short stories, etc.

Thus, semantic markup and the creation of a parallel corpus are tasks that, in our opinion, require a digital analysis of key vocabulary, emotive elements, syntactic features of texts, markers of author's style, etc. When creating a parallel corpus of translations of Chekhov's short stories into German for the Chekhov Digital project, we rely on the results obtained and the experience of using digital methods to develop functional tools for working with the creative heritage of the writer.

The TEI (Text Encoding Initiative) standard is used as the markup language for works on the website of the Chekhov Digital project. It is an international standard for the presentation of texts in digital format, used in the humanities, linguistics and publishing, the main advantages of which are: flexibility and extensibility; interdisciplinarity; support for multilingualism; compatibility with other standards [2]. Overall, TEI provides a powerful and flexible tool for presenting texts in a digital format that can be used in a wide range of disciplines and applications. In addition, flexibility and extensibility allow you to customize it to solve specific tasks, including within the framework of the Chekhov Digital project, by creating new tags or modifying existing ones.

Since TEI is compatible with many other markup languages, it allows you to develop a tag translation system from the HTML markup language to the TEI format. But to do this, you need to study the source code, determine the tag matches and fix them. In a similar way, the structural markup of the corpus of original (Russian-language) works by A. P. Chekhov for the Chekhov Digital project was carried out, including letters, short stories, plays, essays, etc. The source of HTML-marked texts was the ENI Chekhov (electronic scientific publication), posted on the website of the Fundamental Electronic Library [14]. Information about texts for TEI documents was collected from the HTML markup of its pages: titles of works, years of writing, bibliographic descriptions, volumes of texts (number of pages), volume number, availability of headings, various formatting, footnotes, etc. However, the markup on different sites differs, which causes certain difficulties if the code does not contain the necessary information about the works. Such sites include Projekt Gutenberg-DE [8], which contains texts of translations of Chekhov's short stories into German. Using its markup, it is possible to correlate original and translated texts by paragraphs and sentences, however, information such as a bibliographic description, year of publication and translation, text volume (number of pages), etc. is missing, which complicates the process of automating the collection of metadata about the source of translation and including this information in a TEI document. Therefore, information about the translation texts was collected manually and used as a universal reference for filling in metadata in TEI documents.

Nevertheless, some matches were found between the HTML markup of the Projekt Gutenberg-DE website and the TEI markup of the Chekhov Digital project: for example, the HTML tag: <meta name="title" content="..."/> corresponds to the TEI tag: <title>, the HTML tag: <meta name="author" content="..."/> – TEI <author> tag, etc., therefore, some meta information was collected from the pages of the Projekt Gutenberg-DE website. For example, works in German are presented in collections here, so meta information in HTML markup is presented for the entire collection, therefore, when automatically marking up the title of the work (<title> tag), it is necessary to use information from the <h3> tag, which marks the title of the third order (the title of the work in the collection). In addition, the tags <meta name="editor"> or <meta name="translator"> mark all the translators who participated in the publication of the collection, but for a separate work of the collection, use the tag <h5> (fifth-order title), which indicates the name of the translator of this text.

To create a parallel corpus of translations of Chekhov's works into different languages, the Chekhov Digital project uses the <choice> tag, which is used in TEI to represent alternative markup options for the same text fragment. This tag allows you to encode ambiguity or uncertainty in the text and provide the reader or researcher with various interpretation options. In this case, the <choice> tag can be used not only to highlight alternative markup options for the same text fragment, but also to highlight entire text segments that may contain different versions of it. In this case, the <choice> tag may contain child elements containing large chunks of text. In our case, paragraphs or sentences act as text fragments (depending on the variant of correlation of the marked-up texts: by paragraphs or by sentences – in different cases, both the first and the second may be optimal).

There are different ways to highlight two alternative versions of the entire text, one of them is that each option is represented by a paragraph tag <p></p>, which contains the text of the version. For example, each <p> element has a unique identifier specified using the @xml:id attribute, which specifies the text variant – the original or translated into a specific language (de, en, bg, etc.):

<choice>

    <p xml:id="orig">A young ginger dog — a cross between a dachshund and a mongrel — very similar in face to a fox, ran back and forth on the sidewalk and looked around anxiously. Occasionally she stopped and, crying, lifting one cold paw, then the other, tried to give herself an account: How could it be that she got lost?</p>

    <p xml:id="de"> Ein junger rotbrauner Hund – eine Kreuzung von Dachs und Dorfk?ter –, dessen Schnauze der eines Fuchses sehr ?hnelte, lief auf dem Trottoir hin und her und schaute sich unruhig nach allen Seiten um. Zuweilen blieb er stehen, hob winselnd bald die eine, bald die andere seiner frierenden Pfoten und suchte sich dar?ber Rechenschaft zu geben, wie es doch passieren konnte, da? er sich verirrt hatte?</p>

</choice>

The second way: each variant is represented by the <seg> element (from ‘segmentation), which contains the text of the version. The <seg> element has the identifiers <type> and <subtype>, which specify the type of fragment: "original" ("orig") or "translated" ("translated") – and the translation language ("en", "de", "fr"), respectively:

<choice>

     <seg type="orig">A young ginger dog — a cross between a dachshund and a mongrel — very similar in face to a fox, ran back and forth on the sidewalk and looked around anxiously. Occasionally she stopped and, crying, lifting one cold paw, then the other, tried to give herself an account: How could it be that she got lost?</seg>

     <seg type="translated" subtype="de">Ein junger rotbrauner Hund – eine Kreuzung von Dachs und Dorfk?ter –, dessen Schnauze der eines Fuchses sehr ?hnelte, lief auf dem Trottoir hin und her und schaute sich unruhig nach allen Seiten um. Zuweilen blieb er stehen, hob winselnd bald die eine, bald die andere seiner frierenden Pfoten und suchte sich dar?ber Rechenschaft zu geben, wie es doch passieren konnte, da? er sich verirrt hatte?</seg>

</choice>

The <seg> element was chosen as the preferred one for marking the parallel enclosure for a number of reasons. Firstly, its structure allows us to more accurately encode alternative versions of the text, correlated by paragraphs and significant vocabulary, since its purpose is to highlight entire segments or fragments of text. Unlike the xml:id attribute, which allows you to create links to various document elements and can be used, for example, to assign links to named entities or other single text elements, the <seg> tag has a narrower focus. By choosing it as the main one for marking up alternative text segments, we will avoid problems when the same tag performs several multidirectional functions.

Secondly, based on the need for alignment according to the key vocabulary highlighted in the process of digital research, we first strive to divide texts into small segments, within which it will be easier to correlate significant elements. In this case, we can use the <seg> tag to highlight individual paragraphs in each text, indicate the target language, etc. Then, inside these fragments, we can align meaningful vocabulary using the xml:id attribute or others with a specific purpose (names, dates, phraseological units, etc.).

Conclusion

The semantic edition of Chekhov Digital is a dynamically developing project to digitalize the literary heritage of A. P. Chekhov, which is focused on creating new research tools for studying and analyzing his works. One of these tools is a parallel corpus of translations of the writer's fiction into German, with the help of which it will be possible to detect similarities and differences in vocabulary, grammar, style, and cultural references in the texts of translations and originals. The development of semantic markup for a parallel corpus of translations of Chekhov's short stories into German is an important step in the development of this tool. The experience of using digital methods to develop functional tools for working with Chekhov's creative legacy within the framework of the Chekhov Digital project is valuable for future research and practical applications. It demonstrates the possibility of effective use of digital technologies in humanitarian research and education.

In general, the Chekhov Digital project is a promising direction for the development of digital humanities education and research, which contributes to the preservation and popularization of the creative heritage of A. P. Chekhov. The creation of a parallel corpus of translations of the writer's short stories into German expands the functionality of the project and its applicability in various fields of knowledge, including linguistics, literary studies and translation studies.

References
1. Chekhov, A. P. (1974-1983). Polnoe sobranie sochinenii i pisem [Complete works and letters] (Vols. 1-30). Moscow: Nauka.
2. TEI Consortium (Eds.). (2023, November 16). TEI P5: Guidelines for Electronic Text Encoding and Interchange (Version 4.7.0). Retrieved from http://www.tei-c.org/Guidelines/P5/
3. Wikidata. (n.d.). Main Page. Retrieved from https://www.wikidata.org/wiki/Wikidata:Main_Page
4. Severina, E. M., Bonch-Osmolovskaya, A. A., & Kudin, A. M. (2022). Digital philological practices: The "Chekhov Digital" project. Aktual'nye problemy filologii i pedagogicheskoi lingvistiki, 2, 153-165.
5. Dobrovolsky, D. O., Kretov, A. A., & Sharov, S. A. (2005). Corpus of parallel texts: Architecture and possibilities of use. In Nacional'nyi korpus russkogo yazyka: 2003-2005 (pp. 263-296). Moscow: Indrik.
6. Kalinina, E. E. (2017). Frame construction of a sentence as a result of the genesis of word order in Indo-European languages. Mezhdunarodnyi nauchno-issledovatel'skii zhurnal, 5-1(59), 141-143.
7. Severina E.M., & Larionova M.C. New philological practices: semantic edition of A. P. Chekhov’s texts. Philology: scientific researches. 2020. ¹ 10. Ñ. 13-21. doi:10.7256/2454-0749.2020.10.33970 Retrieved from http://en.e-notabene.ru/fmag/article_33970.html
8. Projekt Gutenberg-DE. (n.d.). Retrieved from https://www.projekt-gutenberg.org/
9. Chekhov Digital. (n.d.). Semantic edition of A. P. Chekhov's texts. Retrieved from https://chekhov-digital.sfedu.ru/
10. Tschekhow, A. P. (2023). Gesammelte Erzählungen: Geschichten in Grau + Kleine Erzählungen + Lustige Geschichten + Von Frauen und Kindern + Duell…. Sharp Ink Publishing.
11. Eder, M. (2016). Rolling stylometry. Digital Scholarship in the Humanities, 31(3), 457-469.
12. Fedorov, N. A. (2024). Project Chekhov Digital: Stylometric analysis of texts translated into German from A. P. Chekhov's stories. In Innovations in science and transformation of the paradigm of modern education (pp. 147-150). Kazan: OOO "SANTREM".
13. Menshikov, Ya. S. (2022). Advantages of automatic data collection on the internet over manual data collection. Universum: Technical Sciences, 10-1(103), 33-36.
14. ENI «Chekhov». FEB. (n.d.). Retrieved from https://feb-web.ru/feb/chekhov/default.as

Peer Review

Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
The list of publisher reviewers can be found here.

Digitalization fills all spheres of activity in a galloping way. The humanitarian world is no exception. As the author notes at the beginning of his work, "in modern humanitarian knowledge, digital technologies are becoming increasingly important, corresponding to global trends in digitalization of various spheres of human activity, on the one hand, and on the other allows the development of new forms and research approaches in the humanities. The development of the digital environment is changing the forms of existence of texts, forming new approaches to the formats of publications of literary texts that should become digital, ensuring not only the preservation of texts as cultural objects, but also access to them as carriers of cultural meanings and knowledge, which requires the transformation of philological knowledge into a digital machine-readable format." It is difficult to disagree with this statement, it is objective, it is constructive. In this article, the Chekhov Digital project is analyzed, which is supervised by the Institute of Philology, Journalism and Intercultural Communication of the Southern Federal University together with the Department of Humanitarian Studies of the Southern Scientific Center of the Russian Academy of Sciences and the International Laboratory of Language Convergence of the National Research University Higher School of Economics. I believe that a competent assessment of the project is necessary for further improvement of this digital resource. The article is quite competently structured, a certain alternation of the theoretical level with the practical one makes it possible for the reader (even if not prepared) to get the necessary cut of the assessment. The research methodology correlates with a number of modern developments. The judgments are of a scientific nature, no serious violations have been identified. The style correlates with academic writing: for example, "the Chekhov Digital project uses a text markup structure based on the TEI (Text Encoding Initiative) digital publishing standard, which makes documents machine-readable. Documents marked up in accordance with the TEI principles consist of two parts: a TEI header with source metadata (for example, publication description, title, author's name, text language, etc.), and a text module that includes marked up text information. TEI allows you to take into account the specifics of the text, for example, additional metadata for letters (addressee, date and place of writing, etc.), features of the presentation of information in the text, mark up named entities, biographical information and some social categories (social status, professional affiliation, etc.),"or "creating parallel corpora is a difficult task, which requires solving a number of problems related to text alignment. Alignment is the process of matching text fragments in the source and target languages. Alignment can be performed both manually and with the help of automatic tools, but in any case this process requires significant time and resource costs," etc. The target component boils down to the following: "as part of the study, the principles of semantic markup of translations of Chekhov's fiction into German have been developed within the framework of the Chekhov Digital project, which will allow create a parallel translation corpus in TEI format." Statistical data are introduced into the work taking into account maximum openness: "84 texts of translations of Chekhov's short stories into German were collected on the pages of the Projekt Gutenberg-DE electronic library of German-language texts. The collected corpus represents all the translations that are freely available in this electronic library. The original texts of the writer's works from the Complete Academic Collection of works are presented within the framework of the Chekhov Digital project. At the moment, the total corpus of the texts under study is 168 texts: 84 texts in Russian (original texts), 84 texts of translations made by A. Eliasberg (46), V. Chumikov (12), K. Holm (5); the authorship of 21 texts of translations is not indicated either on the Projekt Gutenberg-DE website or in other collections of translations of A.P. Chekhov's short stories that are freely available on the Internet ...". I believe that the work has a practical focus, the results can be used further in the formation / creation of digital platforms. The conclusions of the work are focused on the main block, the author notes that "in general, the Chekhov Digital project is a promising direction for the development of digital humanities education and research, which contributes to the preservation and popularization of the creative heritage of A. P. Chekhov. The creation of a parallel corpus of translations of the writer's short stories into German expands the functionality of the project and its applicability in various fields of knowledge, including linguistics, literary studies and translation studies." It seems that the work in these segments can be continued further. I state that the topic of this study has been disclosed, the goal has been achieved, the general requirements of the publication have been taken into account, the text does not need serious editing and revision. I recommend the peer-reviewed article "The Chekhov Digital Project: Semantic markup of the parallel corpus of translations of Chekhov's fiction into German" for publication in the journal Philology: Scientific Research.