Shared Task: Discourse-Level Literary Translation


ANNOUNCEMENTS

IMPORTANT DATES

Release of Train, Valid and Test Data πŸ“š May 6th, 2023
Release of Official Test Data πŸš€July 14th, 2023
Result submission deadline πŸ†July 27th, 2023
System description abstract πŸ“July 27th, 2023
System paper submission deadline ❀️TBC - September, 2023

All deadlines are Anywhere on Earth. Please note that the submission process for system papers follows the paper submission policy outlined by WMT. For further details, please refer to the "Paper Submission Information" section on the WMT homepage.

OVERVIEW

Machine translation (MT) faces significant challenges when dealing with literary texts due to their complex nature, as shown in Figure 1. In general, literary MT is bottlenecked by several factors:

Figure 1: Illustration of discourse-level literary translation, which is sampled from our GuoFeng Webnovel Corpus. The colored words demonstrate rich linguistic phenomena.

The main goals of the task are to:

TASK DESCRIPTION

The shared task will be the translation of web fiction texts from Chinese to English.

Participants will be provided with two types of training dataset:

Secondly, we provide two types of validation/testing datasets: Finally, we provide two types of in-domain pretrained models and other general-domain pretrained models listed in General MT Track: In the final testing stage, participants use their systems to translate an official testing set (mixed with Simple and Difficult unseen testsets with two references). The translation quality is measured by a manual evaluation and automatic evaluation metrics. All systems will be ranked by human judgement according to our prefessional guidlines and translators.

The task has Constrained and Unconstrained Track with different constraints on the training of the models: Participants can submit either constrained or unconstrained systems with flags, and we will distinguish their submissions. For example, if you use ChatGPT or Finetuned LLaMA, it is Unconstrained Tack.

DATA

Copyright and Licence

Copyright is a crucial consideration when it comes to releasing literary texts, and we (Tencent AI Lab and China Literature Ltd.) are the rightful copyright owners of the web fictions included in this dataset. We are pleased to make this data available to the research community, subject to certain terms and conditions.

πŸ“ If you use our datasets, please cite the following papers and claim the original download link (http://www2.statmt.org/wmt23/literary-translation-task.html):

Data Description (GuoFeng Webnovel Corpus)

πŸ’Œ The web novels are originally written in Chinese by novel writers and then translated into English by professional translators. As shown in Figure 2, we processed the data using automatic and manual methods: (1) align Chinese books with its English versions by title information; (2) In each book, align Chinese-English chapters according to Chapter ID numbers; (3) Build a MT-based sentence aligner to genrate parallel sentences; (4) ask human annotates to check and revise the alignment errors.

πŸ’‘ Note that (1) some sentences may have no aliged translations, because human translators translate novels in a document way; (2) we keep the all document-level information such as continous chapters and sentences.

Figure 2: Illustration of our data processing method.

Download: We release 22,567 continuous chapters from 179 web novels, covering 14 genres such as fantasy science and romance. The data statistics are listed in Table 1.

# Book# Chapter# SentenceNotes
Train17922,5671,939,187covering 14 genres
Valid 12222755same books with Train
Test 12622697same books with Train
Valid 21010853different books with Train
Test 21212917different books with Train
Testing Input1223916,742different books with Train, super-long documents

🎈 Testing Input 🎈

🎈 🎈

🎈 🎈

Data Format: Taking "train.en" for exaple, the data format is shown as follows: <BOOK id=""> </BOOK> indicates a book boundary, which contains a number of continous chapters with the tag <CHAPTER id=""> </CHAPTER>. The contents are splited into sentences and manually aligned to Chinese sentences in "train.zh".

<BOOK id="100-jdxx">
<CHAPTER id="jdxx_0001">
Chapter 1 Make Your Choice, Youth
"Implode reality, pulverize thy spirit. By banishing this world, comply with the blood pact, I will summon forth thee, O' young Demon King!"
At a park during sunset, a childlike, handsome youth placed his left hand on his chest, while his right hand was stretched out with his fingers wide open, as though he was about to release something amazing from his palm. He looked serious and solemn.
... ...
</CHAPTER>
<CHAPTER id="jdxx_0002">
....
</CHAPTER>
</BOOK>

Pretrained Models

We provide two types of in-domain pretrained models:
VersionLayerHidden SizeVocabulary SizeContinuous Train
RoBERTabase12 enc76821,128Chinese literary texts (84B tokens)
mBARTCC2512 enc + 12 dec1,024250,000English and Chinese literary texts (114B tokens)

🎈 🎈

Evaluation Methods

πŸ€– Automatic Evaluation: To evaluate the performance of the well-trained models, we will report multiple evaluation metrics, including d-BLEU (document-level sacreBLEU), d-COMET (document-level COMET) to measure the overall accuracy and fluency of the translations.

πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦ Human Evaluation: Besides, we provide professional translators to assess the translations based on more subjective criteria, such as the preservation of literary style and the overall coherence and cohesiveness of the translated texts. Based on our experience with this project, we designed a fine-grained error typology and marking criteria for literary MT.

Result submissions


Submission Email Format:

Subject: WMT2023 Literary Translation Submission (Team Name)
Basic Information: your team name, affiliations, team member names.
System Flag: constrained or unconstrained.
System Description: main techniques and toolkits used in your three submission systems.
Attachment: File names associated with testing input IDs (primary-1.out, primary-2.out, ..., contrastive1-1.out, ..., contrastive2-12.out)

COMMITTEE

Organization Team

Evaluation Team

Advisory Committee

Contact

If you have any further questions or suggestions, please do not hesitate to guofeng-ai googlegroup or send an email to Longyue Wang.

Sponsors