Non-Repetitive Translation Task

EMNLP 2024

NINTH CONFERENCE ON
MACHINE TRANSLATION (WMT24)

November 15-16, 2024
Miami, Florida, USA

[HOME] [PROGRAM] [PAPERS] [AUTHORS]
TRANSLATION TASKS: [GENERAL MT (NEWS)] [LOW-RESOURCE LANGUAGES OF SPAIN] [INDIC MT] [CHAT TASK] [BIOMEDICAL] [MULTIINDIC22MT TASK] [ENGLISH-TO-LOWRES MULTIMODAL MT TASK] [NON-REPETITIVE] [PATENT] [LITERARY]
EVALUATION TASKS: [METRICS TASK] [MT TEST SUITES] [QUALITY ESTIMATION]
OTHER TASKS: [OPEN LANGUAGE DATA INITIATIVE]

TASK DESCRIPTION

This task focuses on lexical choice in machine translation, especially choice regarding repeated words in a source sentence. Generally, the repetition of the same words can create a monotonous or awkward impression in English, and it should be appropriately avoided. Typical workarounds in monolingual writing are to (1) remove redundant terms if possible (reduction) or (2) use alternative words such as synonyms as substitutes (substitution). These techniques are also observed in human translations. Here are some examples, which are all not contained in the test set of this task:

Table 1. Examples of translations with reduction and substitution from Jiji Japanese–English news articles. For comparison, the consistent translations are also included.
Type	Japanese	Consistent translation (Ja→En)	English (Original)	Description
Reduction	耐震化を済ませていない４９４団体に今後の対応を尋ねたところ、改修するのは７０団体、建て替えは２６５団体、移転が１１団体だった。	When the 494 organizations that had not yet completed earthquake proofing were asked about their future measures, 70 organizations opted for retrofitting, 265 chose rebuilding, and 11 selected relocation.	Of the 494 unprepared municipalities, 70 are set to carry out repairs, 265 will construct new buildings and 11 are planning relocation.	Ellipsis: In the original English sentence, a noun ellipsis occurs, e.g., "70 municipalities" is expressed as "70."
	開発費を参加国間で分担できるため、国産開発に比べて費用を安く抑えることが可能となる。	Since development expenses can be shared among participating countries, it will be possible to keep costs lower than domestic development.	It will allow the government to cut spending compared with full domestic development by sharing costs with partner countries.	Semantic pleonasm: "costs" is used instead of "development costs" in the original English sentence probably because it is contextually inferable.
	同社はニューヨーク州のヨンカース工場と中西部ネブラスカ州のリンカーン工場で車両の製造や試験を行う。	The company will manufacture and test vehicles at its Yonkers, New York, factory and its Lincoln, Nebraska, factory in the Midwest.	Kawasaki Rail Car will build and test the subway cars at its facilities in Yonkers and in Lincoln, Nebraska.	Sharing heads: The two nouns ("facility") are merged into one and the noun head is shared by the two prepositional phrases. Although strictly they are not reduced, we also consider these examples to be a type of reduction.
Substitution	農作物への影響が心配されるが、農林水産省は「（首都圏などでは）積雪が長引かなかったので大きな影響はない」（園芸作物課）とみている。	There are concerns about the impact on crops, but an official at the Horticultural Crops Division of the Ministry of Agriculture, Forestry and Fisheries (MAFF) said, "the snowfall (in the Tokyo metropolitan area and other regions) was not prolonged, so there will be no major impact."	Although many people are worried about the effects of harsh cold on crops, an official of Japan’s agricultural ministry predicted that there will be no significant impact, as the snow did not stay for long in areas such as the Tokyo metropolitan area.	Synonym: Words with similar meaning are typically used for substitution.
	物質を構成する素粒子の振る舞いは「標準理論」で説明されるが、宇宙の全質量の４分の１を占める「暗黒物質」など説明できない部分もある。	The Standard theory explains the behavior of elementary particles, which make up matter, but it cannot explain some things, such as dark matter, which makes up one quarter of the mass of the universe.	The so-called Standard Model explains the behavior of elementary particles, the fundamental building blocks of matter. But the theory leaves some mysteries, such as dark matter which is thought to make up about a quarter of the mass of the universe.	Non-literal translation: Repeated words are sometimes translated in a non-literal manner.
	当時、テニス部の生徒６人とコーチがコートで練習をしており、生徒の１人がボールを拾おうとしたところ、隣のコートにパラシュート状の物があることに気付いたという。	At the time, six students and the coach from the tennis club were reportedly practicing on the court when one of the students went to pick up a ball and noticed a parachute-like object on the adjacent court.	At the time, the student was practicing tennis with five other students and one coach at another court next to the one where the parachute was found.	Pronouns/Pro-verbs: Repeated words are sometimes substituted with pronouns or pro-verbs, such as "it" and "do so."

The goal of this task is to study how these techniques can be incorporated into machine translation systems to enrich lexical choice capabilities. From a practical standpoint, such capability would be important, for example, in news production, where high quality text that goes beyond robotic word-by-word translation is required.

Specifically, participants are required to control a machine translation system using reduction or substitution so that it does not output the same words for certain repeated words in a source sentence. The translation direction is Japanese to English.

CHALLENGES

The challenges underlying this task include the following:

Maintaining the balance between translation quality and controlling the output: The translation quality can be degraded when the non-repetitive style is inappropriately enforced.
Avoiding bias toward high-frequency bilingual word pairs: In general, for a given source word, high-frequency target words associated with it are more likely to be output. This can make it difficult to determine appropriate substitutions for some words.
Predicting which words can be reduced or substituted: Predicting which source words can be reduced or substituted appropriately is not an easy problem because it depends on the context within the sentence.
Mining training instances: Translations with reduction can be especially difficult to identify in noisy corpora because of the challenge of discriminating them from undertranslations.

DATA SET

We provide development and test sets for this task, which are referred to as the Jiji 2023 data and the Jiji 2024 data, respectively, in this description. In both data sets, all Japanese sentences contain some repeated words that are translated into English with reduction or substitution. We collected these data from Jiji Japanese–English news articles. Specifically, we first automatically created sentence pairs based on lexical similarities, and then manually selected instances suited for this task. These sentence pairs include not only one-to-one pairs but two-to-two pairs. Both the development and test sets contain raw and tagged parallel data. In the tagged data, repeated words in the source sentence and their counterparts in the target sentence are marked with tags, which indicates that these words are evaluation targets. Examples are as follows:

Table 2. Examples of raw and tagged parallel sentences.
File Type	Japanese	English
Raw	開発費を参加国間で分担できるため、国産開発に比べて費用を安く抑えることが可能となる。	It will allow the government to cut spending compared with full domestic development by sharing costs with partner countries.
Tagged	<target>開発<\target>費を参加国間で分担できるため、国産<target>開発<\target>に比べて費用を安く抑えることが可能となる。	It will allow the government to cut spending compared with full domestic <target>development<\target> by sharing costs with partner countries.
Raw	農作物への影響が心配されるが、農林水産省は「（首都圏などでは）積雪が長引かなかったので大きな影響はない」（園芸作物課）とみている。	Although many people are worried about the effects of harsh cold on crops, an official of Japan’s agricultural ministry predicted that there will be no significant impact, as the snow did not stay for long in areas such as the Tokyo metropolitan area.
Tagged	農作物への<target>影響<\target>が心配されるが、農林水産省は「（首都圏などでは）積雪が長引かなかったので大きな<target>影響<\target>はない」（園芸作物課）とみている。	Although many people are worried about <target>the effects<\target> of harsh cold on crops, an official of Japan’s agricultural ministry predicted that there will be no significant <target>impact<\target>, as the snow did not stay for long in areas such as the Tokyo metropolitan area.

Table 3. Data statistics of the development data. (The Jiji 2023 data.)
File name	The number of parallel sentences
wat2023.devtest.raw.en	162
wat2023.devtest.raw.ja	162
wat2023.devtest.tagged.en	162
wat2023.devtest.tagged.ja	162

Table 4. Data statistics of the test data. (The Jiji 2024 data.)
File name	The number of parallel sentences
wmt2024.test.raw.en	470
wmt2024.test.raw.ja	470
wmt2024.test.tagged.en	470
wmt2024.test.tagged.ja	470

Note that not all words repeated in the source sentence are evaluation targets. This is because some words, such as proper nouns and technical terms, should be translated consistently, even if they are repeated in the sentence.

Tagged development data are provided to help tune the model during training. However, participants cannot use tagged test data and must use raw test data when submitting the system results. In this task, the systems must detect repeated words which can be reduced or substituted on their own.

To reduce the negative effects of imbalanced content in the source and target sentences, the Japanese sentences in the Jiji 2023 and 2024 data were manually translated from the English while preserving as much of the vocabulary of the original Japanese sentences as possible.

As for the training data, we also provide all the data from the WAT 2020 Newswire tasks, which were also constructed from Jiji news articles and have been continuously used in WAT from 2020. For simplicity, we refer to these data as the Jiji 2020 data. The main files in the Jiji 2020 data are as follows:

Table 5. Data statistics of the training data. (The Jiji 2020 data.)
File name	The number of parallel sentences
train-sim.txt	200K with lexical similarity scores
devc.txt	479
testc.txt	1851

The above data are a regular parallel corpus and have not been annotated specifically for this task, but are in exactly the same domain as this task. Although "devc.txt" and "testc.txt" are not directly related to the evaluation of this task, these can be used to measure basic translation performance during training.

In addition, participants can also use any other publicly available corpora, such as the data from the general MT task in WMT, for training. When using external data, be sure to include an explanation about the data in your paper.

To obtain the train (Jiji 2020), development (Jiji 2023) and test (Jiji 2024) data, follow these steps:

Complete and sign the license agreement: English or Japanese. Please read the license agreement carefully.
Scan and email the signed agreement to Jiji Press Ltd. (asaka@jiji.co.jp), and send the original copy of the agreement to the following address:

English:
ASAKA, Hidehiro
Sports Business Promotion Office
JIJI Press LTD.
5-15-8 Ginza, Chuo-ku,
Tokyo 104-8178, JAPAN

Japanese:
104-8178
東京都中央区銀座5-15-8
時事通信社スポーツ事業推進室
朝賀英裕
The organizers will email the link to download these corpora to the applicant, once Jiji Press Ltd. has received the original copy and approved the application. Please note Jiji Press Ltd. will provide the e-mail address of the applicant to the organizers.

Please anonymize any personal information when you include such text from the Jiji data in your papers and presentations.

EVALUATION

System performance is evaluated using the total number of outputs that meet both acceptable translation adequacy and appropriate lexical choice. Both dimensions will be checked by human annotators, who will be assigned by the organizers.

Regarding the evaluation for lexical choice, the translations for the tagged source words are checked. (Again, the tags must be blinded to the system during inference.) Whether untagged repeated words are translated in a repetitive or non-repetitive way does not affect the lexical choice evaluation. Here, the technique (reduction or substitution) does not have to be consistent with that of the reference translation. In our preliminary investigations, we qualitatively studied the lexical choices of several translators, and observed cases where one translator chose substitution and another chose reduction. In addition, the systems do not have to choose the same words used in the reference, as long as the meaning is appropriate.

The determination of substitution or repetition is basically based on the word stem. For example, conversions between voice (e.g., "attack" and "be attacked"), tense (e.g., "study" and "studied") and parts of speech (e.g., "problematic" and "problem") are not considered to be substitutions. Conversions to idioms (e.g., "visit" and "pay a visit") are an exception and are handled as substitutions.

In addition, BLEU scores and style (i.e., non-repetitive or repetitive) estimations using a word aligner are also reported as a reference.

IMPORTANT DATES

Release of the training data (the Jiji 2020 data)

April 17, 2024

Release of the development data (the Jiji 2023 data)

May 23, 2024

Release of the test data (the Jiji 2024 data)

July 14 15-16, 2024

Submission deadline for the task

July 21, 2024

Paper submission deadline

TBA around August 20 (following EMNLP/WMT)

Notification of acceptance

TBA (following EMNLP/WMT)

Camera-ready deadline

TBA around October 3 (following EMNLP/WMT)

Conference

November 15-16, 2024

SUBMISSION FORMAT

Please send your system translation as a file named "YOUR_TEAM_NAME.txt" by email to kinugawa.k-jg@nhk.or.jp and mino.h-gq@nhk.or.jp.

ORGANIZIZERS

Kazutaka Kinugawa (kinugawa.k-jg@nhk.or.jp), NHK
Hideya Mino (mino.h-gq@nhk.or.jp), NHK
Naoto Shirai (shirai.n-hk@nhk.or.jp), NHK
Isao Goto (goto.isao.fn@ehime-u.ac.jp), Ehime University

ACKNOWLEDGEMENTS

We are deeply grateful to Hidehiro Asaka and Takayuki Kawakami for providing the valuable data used in this research.
These research results were obtained from the commissioned research (No. 225) by National Institute of Information and Communications Technology (NICT), Japan.

CHANGE LOG

2024-07-18: Modify the number of test sentences in Table 4
2024-07-15: Update the test data relase date
2024-07-05: Update the submisson format
2024-07-04: Fix typo in Table 1
2024-07-04: Update the test data release date to July 14
2024-06-14: Update submission deadline for the task
2024-06-05: Add data statistics of dev and test data
2024-06-05: Add the explanation about sentence pairs
2024-06-05: Add the note for using an external data
2024-06-05: Clearfy the names of the training (Jiji 2020), dev (Jiji 2023) and test (Jiji 2024) data
2024-05-23: The development data (Jiji 2023) released
2024-05-02: Site opened

NINTH CONFERENCE ON MACHINE TRANSLATION (WMT24)