EMNLP 2024

NINTH CONFERENCE ON
MACHINE TRANSLATION (WMT24)

November 12-13, 2024
Miami, Florida, USA
 
[HOME]
TRANSLATION TASKS: [GENERAL MT (NEWS)] [LOW-RESOURCE LANGUAGES OF SPAIN] [INDIC MT] [CHAT TASK] [BIOMEDICAL] [MULTIINDIC22MT TASK] [ENGLISH-TO-LOWRES MULTIMODAL MT TASK] [NON-REPETITIVE] [PATENT] [LITERARY]
EVALUATION TASKS: [METRICS TASK] [MT TEST SUITES] [QUALITY ESTIMATION]
OTHER TASKS: [OPEN LANGUAGE DATA INITIATIVE]
Logo

ANNOUNCEMENTS

  • 20th May 2024 - The Chinese-Russian dataset is released (new in this year). 🔥

  • 14th May 2024 - The Chinese-German dataset is released (new in this year). 🔥

  • 12th May 2024 - The Chinese-English dataset is released (same as last year). 📚

  • 10th May 2024 - The shared task is announced. 🌍

  • The New in This Year - Two more language pairs; A/B Testing evaluation method. ⚠️

DEADLINES

All deadlines are in AoE (Anywhere on Earth). Please note that the submission process for system papers follows the paper submission policy outlined by WMT. For further details, please refer to the IMPORTANT DATES on the WMT homepage.

Release of training and validation data

May, 2024

Test data released

27th June, 2024

Translation submission deadline

4th July, 2024

System description abstract paper

11th July, 2024

Paper submission deadline

TBA

OVERVIEW

Machine translation (MT) faces significant challenges when dealing with literary texts due to their complex nature. In general, literary MT is bottlenecked by several factors:

  • 😢 Limited Training Data: Most existing document-level datasets are comprised of news articles and technical documents, with limited availability of high-quality, discourse-level parallel data in the literary domain. This scarcity of data makes it difficult to develop systems that can handle the complexities of literary translation.

  • 😱 Rich Linguistic Phenomena: Literary texts contain more complex linguistic knowledge than non-literary ones, especially with regard to discourse. To generate a cohesive and coherent output, MT models require an understanding of the intended meaning and structure of the text at the discourse level.

  • 😅 Long-Range Contex: Literary works, such as novels, have much longer contexts than texts in other domains, such as news articles. Translation models must acquire the ability to model long-range context in order to learn translation consistency and appropriate lexical choices.

  • 😔 Unreliable Evaluation Methods: Evaluating literary translations requires measuring the meaning and structure of the text, as well as the nuances and complexities of the source language. A single automatic evaluation using a single reference is often unreliable. Thus, professional translators with well-defined scoring standards and targeted evaluation methods are considered a complement.

GOALS

The main goals of the task are to:

  • 😊 Encourage research in machine translation, large language model and language agent for document modelling, discourse knowledge itegration and literary translation.

  • 🤗 Provide a platform for researchers to evaluate and compare the performance of different methods and systems on this challenging dataset.

  • 😃 Advance the state of the art in machine translation for practical application scenarios.

LANGUAGE PAIRS

  • Chinese-English (document-level with cross-sentence alignment information)

  • 🆕 Chinese-German (document-level without alignment information, may contain some translation errors)

  • 🆕 Chinese-Russian (document-level without alignment information, may contain some translation errors)

TASK DESCRIPTION

The shared task will be the translation of web fiction texts in three directions: Chinese→English, Chinese→German, Chinese→Russian.

First, Participants will be provided with two types of training dataset:

  • GuoFeng Webnovel Corpus v1: we release a in-domain, discourse-level and human-translated training dataset with sentence-level alignment for Chinese→English (same as last year).

  • GuoFeng Webnovel Corpus v2: we release two in-domain and discourse-level training dataset for Chinese→German and Chinese→Russian (new in this year).

  • General MT Track Parallel Training Data: you can use all sentence-/document-level parallel training data of the general translation task (please go to General MT).

Second, we provide two types of validation/testing datasets:

  • Simple Set contains unseen chapters in the same web novels as the training data;

  • Difficult Set contains chapters in different web novels from the training data.

Third, we provide two types of in-domain pretrained models for Chinese→English (same as last year). You can also use the WMT allowed general-domain LMs/LLMs:

  • Chinese-Llama-2 (7B): The Llama-2 model is continiously pretrained on 400GB Chinese and English literary texts, and then finetuned on Chinese instruction dataset (BAAI/COIG) and Chinese-English Document-level translation dataset, without changging the vocabulary.

  • In-domain RoBERTa (base): 12 layer encoder, hidden size 768, vocabulary size 21,128, whole word masking. It was originally pretrained on Chinese Wikipedia. We continously train it with Chinese literary texts (84B tokens).

  • In-domain mBART (CC25): 12 layer encoder and 12 layer decoder, hidden size 1024, vocabulary size 250,000. It was originally trained with 25 language web corpus. We continously train it with English and Chinese literary texts (114B tokens).

  • General-domain LMs/LLMs: Llama-2-7B, Llama-2-13B, Mistral-7B; mBART, BERT, RoBERTa, XLM-RoBERTa, sBERT, LaBSE (please go to The limitations for the constrained systems track).

In the final testing stage, participants use their systems to translate an official testing set (mixed with Simple and Difficult unseen testsets). The translation quality is measured as follows. All systems will be ranked by human judgement or A/B testing according to our prefessional guidlines.

  • manual evaluation by human translators, e.g. MQM;

  • automatic evaluation metrics with two references;

  • A/B testing by web fiction readers (new in this year).

Besides, The task has Constrained Track and Unconstrained Track with different constraints on the training of the models. Participants can submit either constrained or unconstrained systems with flags, and we will distinguish their submissions. For example, if you finetuned Llama-2-7B on the above data, it is Constrained Tack.

  • Constrained Tack. You may ONLY use the training data specified above.

  • Unconstrained Tack. It allows the participation with a system trained without any limitations.

DATA

Copyright is a crucial consideration when it comes to releasing literary texts, and we (Tencent AI Lab and China Literature Ltd.) are the rightful copyright owners of the web fictions included in this dataset. We are pleased to make this data available to the research community, subject to certain terms and conditions.

  • 🔔 GuoFeng Webnovel Corpus are copyrighted by Tencent AI Lab and China Literature Limited.

  • 🚦 After completing the registration process with your institute information, WMT participants or researchers are granted permission to use the dataset solely for non-commercial research purposes and must comply with the principles of fair use (CC-BY 4.0).

  • 🔒 Modifying or redistributing the dataset is strictly prohibited. If you plan to make any changes to the dataset, such as adding more annotations, with the intention of publishing it publicly, please contact us first to obtain written consent.

  • 🚧 By using this dataset, you agree to the terms and conditions outlined above. We take copyright infringement very seriously and will take legal action against any unauthorized use of our data.

Citation

📝 If you use the GuoFeng Webnovel Corpus, please cite the following papers and claim the original download link:

Data Description (GuoFeng Webnovel Corpus V1)

💌 The web novels are originally written in Chinese by novel writers and then translated into English by professional translators. The processing steps are detailed in [1]. Note that (1) some sentences may have no aliged translations, because human translators translate novels in a document way; (2) we keep the all document-level information such as continous chapters and sentences.

Chinese→English: We release 22,567 continuous chapters from 179 web novels, covering 14 genres such as fantasy science and romance. The data statistics are listed in Table 1.

Subset

# Book

# Chapter

# Sentence

Notes

Train

179

22,567

1,939,187

covering 14 genres

Valid 1

22

22

755

same books with Train

Test 1

26

22

697

same books with Train

Valid 2

10

10

853

different books with Train

Test 2

12

12

917

different books with Train

Testing Input

 — 

 — 

 — 

 — 

🎈

Data Format: Taking "train.en" for exaple, the data format is shown as follows: <BOOK id=""> </BOOK> indicates a book boundary, which contains a number of continous chapters with the tag <CHAPTER id=""> </CHAPTER>. The contents are splited into sentences and manually aligned to Chinese sentences in "train.zh".

<BOOK id="100-jdxx">
<CHAPTER id="jdxx_0001">
Chapter 1 Make Your Choice, Youth
"Implode reality, pulverize thy spirit. By banishing this world, comply with the blood pact, I will summon forth thee, O' young Demon King!"
At a park during sunset, a childlike, handsome youth placed his left hand on his chest, while his right hand was stretched out with his fingers wide open, as though he was about to release something amazing from his palm. He looked serious and solemn.
... ...
</CHAPTER>
<CHAPTER id="jdxx_0002">
....
</CHAPTER>
</BOOK>

Data Description (GuoFeng Webnovel Corpus V2)

💌 The web novels are originally written in Chinese by novel writers and then auto-translated into German/Russian with human post-editing. Note that (1) the data are document-level without alignment information; (2) the current version may contain some translation errors, e.g. mistranslation.

Chinese→German

Subset

# Book

# Chapter

# Sentence

Notes

Train

145

599,647

 — 

covering 14 genres

Valid

 — 

 — 

 — 

 — 

Test

 — 

 — 

 — 

 — 

Testing Input

 — 

 — 

 — 

 — 

🎈

Chinese→Russian (TBA)

Pretrained Models

We provide three types of in-domain pretrained models (same as last year) and large language models (new in this year):

Version

Layer

Hidden Size

Vocabulary Size

Continuous Train

Chinese-Llama-2 7B

32

4,096

32,000

Chinese and English literary texts (115B tokens)

RoBERTa base

12 enc

768

21,128

Chinese literary texts (84B tokens)

mBART CC25

12 enc + 12 dec

1,024

250,000

English and Chinese literary texts (114B tokens)

🎈
🎈

EVALUATION METHODS

  • 🤖 Automatic Evaluation: To evaluate the performance of the well-trained models, we will report multiple evaluation metrics, including d-BLEU (document-level sacreBLEU), d-COMET (document-level COMET) to measure the overall accuracy and fluency of the translations.

  • 👩‍🏫 Human Evaluation: Besides, we provide professional translators to assess the translations based on more subjective criteria, such as the preservation of literary style and the overall coherence and cohesiveness of the translated texts. Based on our experience with this project, we designed a fine-grained error typology and marking MQM criteria for literary MT.

  • 👨‍👩‍👧‍👦 A/B Testing: Acknowledging the concern that there is no single, universally preferred translation for literary texts, we ask human readers or LLMs to select their preferred contents in practical application scenarios.

RESULTS SUBMISSION

  • Participants can submit either constrained or unconstrained systems with flags, and we will distinguish their submissions.

  • Each team can submit at most 3 MT outputs per language pair direction, one primary and up to two contrastive.

  • Submissions will be done by sending us an email to [Longyue Wang](mailto:vincentwang0229@gmail.com).

  • The requirements of submission format are (1) Keep N output files that are identical to the testing input files. (2) In the output files, ensure that each line is aligned with the corresponding input line. If a particular input line is blank, the corresponding output line should also be blank.

Subject: WMT2024 Literary Translation Submission (Team Name)
Basic Information: your team name, affiliations, team member names.
System Flag: constrained or unconstrained / en-zh or de-zh or ru-zh.
System Description: main techniques and toolkits used in your three submission systems.
Attachment: File names associated with testing input IDs (primary-1.en-zh.out, primary-2.en-zh.out, ..., contrastive1-1.en-zh.out, ..., contrastive2-12.en-zh.out)

COMMITTEE

Organization Team

Evaluation Team

Advisory Committee

Contact

If you have any further questions or suggestions, please do not hesitate to send an email to Longyue Wang (vincentwang0229@gmail.com).

SPONSORS

Logo1
Logo2