TASK DESCRIPTION
This task focuses on lexical choice in machine translation, especially choice regarding repeated words in a source sentence. Generally, the repetition of the same words can create a monotonous or awkward impression in English, and it should be appropriately avoided. Typical workarounds in monolingual writing are to (1) remove redundant terms if possible (reduction) or (2) use alternative words such as synonyms as substitutes (substitution). These techniques are also observed in human translations. Here are some examples, which are all not contained in the test set of this task:
Type | Japanese | Consistent translation (Ja→En) | English (Original) | Description |
---|---|---|---|---|
Reduction |
耐震化を済ませていない494団体に今後の対応を尋ねたところ、改修するのは70団体、建て替えは265団体 、移転が11団体だった。 |
When the 494 organizations that had not yet completed earthquake proofing were asked about their future measures, 70 organizations opted for retrofitting, 265 chose rebuilding, and 11 selected relocation. |
Of the 494 unprepared municipalities, 70 are set to carry out repairs, 265 will construct new buildings and 11 are planning relocation. |
Ellipsis: In the original English sentence, a noun ellipsis occurs, e.g., "70 municipalities" is expressed as "70." |
開発費を参加国間で分担できるため、国産開発に比べて費用を安く抑えることが可能となる。 |
Since development expenses can be shared among participating countries, it will be possible to keep costs lower than domestic development. |
It will allow the government to cut spending compared with full domestic development by sharing costs with partner countries. |
Semantic pleonasm: "costs" is used instead of "development costs" in the original English sentence probably because it is contextually inferable. |
|
同社はニューヨーク州のヨンカース工場と中西部ネブラスカ州のリンカーン工場で車両の製造や試験を行う。 |
The company will manufacture and test vehicles at its Yonkers, New York, factory and its Lincoln, Nebraska, factory in the Midwest. |
Kawasaki Rail Car will build and test the subway cars at its facilities in Yonkers and in Lincoln, Nebraska. |
Sharing heads: The two nouns ("facility") are merged into one and the noun head is shared by the two prepositional phrases. Although strictly they are not reduced, we also consider these examples to be a type of reduction. |
|
Substitution |
農作物への影響が心配されるが、農林水産省は「(首都圏などでは)積雪が長引かなかったので大きな影響はない」(園芸作物課)とみている。 |
There are concerns about the impact on crops, but an official at the Horticultural Crops Division of the Ministry of Agriculture, Forestry and Fisheries (MAFF) said, "the snowfall (in the Tokyo metropolitan area and other regions) was not prolonged, so there will be no major impact." |
Although many people are worried about the effects of harsh cold on crops, an official of Japan’s agricultural ministry predicted that there will be no significant impact, as the snow did not stay for long in areas such as the Tokyo metropolitan area. |
Synonym: Words with similar meaning are typically used for substitution. |
物質を構成する素粒子の振る舞いは「標準理論」で説明されるが、宇宙の全質量の4分の1を占める「暗黒物質」など説明できない部分もある。 |
The Standard theory explains the behavior of elementary particles, which make up matter, but it cannot explain some things, such as dark matter, which makes up one quarter of the mass of the universe. |
The so-called Standard Model explains the behavior of elementary particles, the fundamental building blocks of matter. But the theory leaves some mysteries, such as dark matter which is thought to make up about a quarter of the mass of the universe. |
Non-literal translation: Repeated words are sometimes translated in a non-literal manner. |
|
当時、テニス部の生徒6人とコーチがコートで練習をしており、生徒の1人がボールを拾おうとしたところ、隣のコートにパラシュート状の物があることに気付いたという。 |
At the time, six students and the coach from the tennis club were reportedly practicing on the court when one of the students went to pick up a ball and noticed a parachute-like object on the adjacent court. |
At the time, the student was practicing tennis with five other students and one coach at another court next to the one where the parachute was found. |
Pronouns/Pro-verbs: Repeated words are sometimes substituted with pronouns or pro-verbs, such as "it" and "do so." |
The goal of this task is to study how these techniques can be incorporated into machine translation systems to enrich lexical choice capabilities. From a practical standpoint, such capability would be important, for example, in news production, where high quality text that goes beyond robotic word-by-word translation is required.
Specifically, participants are required to control a machine translation system using reduction or substitution so that it does not output the same words for certain repeated words in a source sentence. The translation direction is Japanese to English.
CHALLENGES
The challenges underlying this task include the following:
-
Maintaining the balance between translation quality and controlling the output: The translation quality can be degraded when the non-repetitive style is inappropriately enforced.
-
Avoiding bias toward high-frequency bilingual word pairs: In general, for a given source word, high-frequency target words associated with it are more likely to be output. This can make it difficult to determine appropriate substitutions for some words.
-
Predicting which words can be reduced or substituted: Predicting which source words can be reduced or substituted appropriately is not an easy problem because it depends on the context within the sentence.
-
Mining training instances: Translations with reduction can be especially difficult to identify in noisy corpora because of the challenge of discriminating them from undertranslations.
DATA SET
We provide development and test sets for this task, which are referred to as the Jiji 2023 data and the Jiji 2024 data, respectively, in this description. In both data sets, all Japanese sentences contain some repeated words that are translated into English with reduction or substitution. We collected these data from Jiji Japanese–English news articles. Specifically, we first automatically created sentence pairs based on lexical similarities, and then manually selected instances suited for this task. These sentence pairs include not only one-to-one pairs but two-to-two pairs. Both the development and test sets contain raw and tagged parallel data. In the tagged data, repeated words in the source sentence and their counterparts in the target sentence are marked with tags, which indicates that these words are evaluation targets. Examples are as follows:
File Type |
Japanese |
English |
Raw |
開発費を参加国間で分担できるため、国産開発に比べて費用を安く抑えることが可能となる。 |
It will allow the government to cut spending compared with full domestic development by sharing costs with partner countries. |
Tagged |
<target>開発<\target>費を参加国間で分担できるため、国産<target>開発<\target>に比べて費用を安く抑えることが可能となる。 |
It will allow the government to cut spending compared with full domestic <target>development<\target> by sharing costs with partner countries. |
Raw |
農作物への影響が心配されるが、農林水産省は「(首都圏などでは)積雪が長引かなかったので大きな影響 はない」(園芸作物課)とみている。 |
Although many people are worried about the effects of harsh cold on crops, an official of Japan’s agricultural ministry predicted that there will be no significant impact, as the snow did not stay for long in areas such as the Tokyo metropolitan area. |
Tagged |
農作物への<target>影響<\target>が心配されるが、農林水産省は「(首都圏などでは)積雪が長引かなかったので大きな<target>影響<\target>はない」(園芸作物課)とみている。 |
Although many people are worried about <target>the effects<\target> of harsh cold on crops, an official of Japan’s agricultural ministry predicted that there will be no significant <target>impact<\target>, as the snow did not stay for long in areas such as the Tokyo metropolitan area. |
File name |
The number of parallel sentences |
wat2023.devtest.raw.en |
162 |
wat2023.devtest.raw.ja |
162 |
wat2023.devtest.tagged.en |
162 |
wat2023.devtest.tagged.ja |
162 |
File name |
The number of parallel sentences |
wmt2024.test.raw.en |
470 |
wmt2024.test.raw.ja |
470 |
wmt2024.test.tagged.en |
470 |
wmt2024.test.tagged.ja |
470 |
Note that not all words repeated in the source sentence are evaluation targets. This is because some words, such as proper nouns and technical terms, should be translated consistently, even if they are repeated in the sentence.
Tagged development data are provided to help tune the model during training. However, participants cannot use tagged test data and must use raw test data when submitting the system results. In this task, the systems must detect repeated words which can be reduced or substituted on their own.
To reduce the negative effects of imbalanced content in the source and target sentences, the Japanese sentences in the Jiji 2023 and 2024 data were manually translated from the English while preserving as much of the vocabulary of the original Japanese sentences as possible.
As for the training data, we also provide all the data from the WAT 2020 Newswire tasks, which were also constructed from Jiji news articles and have been continuously used in WAT from 2020. For simplicity, we refer to these data as the Jiji 2020 data. The main files in the Jiji 2020 data are as follows:
File name |
The number of parallel sentences |
train-sim.txt |
200K with lexical similarity scores |
devc.txt |
479 |
testc.txt |
1851 |
The above data are a regular parallel corpus and have not been annotated specifically for this task, but are in exactly the same domain as this task. Although "devc.txt" and "testc.txt" are not directly related to the evaluation of this task, these can be used to measure basic translation performance during training.
In addition, participants can also use any other publicly available corpora, such as the data from the general MT task in WMT, for training. When using external data, be sure to include an explanation about the data in your paper.
To obtain the train (Jiji 2020), development (Jiji 2023) and test (Jiji 2024) data, follow these steps:
-
Complete and sign the license agreement: English or Japanese. Please read the license agreement carefully.
-
Scan and email the signed agreement to Jiji Press Ltd. (asaka@jiji.co.jp), and send the original copy of the agreement to the following address:
English:
ASAKA, Hidehiro
Sports Business Promotion Office
JIJI Press LTD.
5-15-8 Ginza, Chuo-ku,
Tokyo 104-8178, JAPAN
Japanese:
104-8178
東京都中央区銀座5-15-8
時事通信社スポーツ事業推進室
朝賀英裕
-
The organizers will email the link to download these corpora to the applicant, once Jiji Press Ltd. has received the original copy and approved the application. Please note Jiji Press Ltd. will provide the e-mail address of the applicant to the organizers.
Please anonymize any personal information when you include such text from the Jiji data in your papers and presentations.
EVALUATION
System performance is evaluated using the total number of outputs that meet both acceptable translation adequacy and appropriate lexical choice. Both dimensions will be checked by human annotators, who will be assigned by the organizers.
Regarding the evaluation for lexical choice, the translations for the tagged source words are checked. (Again, the tags must be blinded to the system during inference.) Whether untagged repeated words are translated in a repetitive or non-repetitive way does not affect the lexical choice evaluation. Here, the technique (reduction or substitution) does not have to be consistent with that of the reference translation. In our preliminary investigations, we qualitatively studied the lexical choices of several translators, and observed cases where one translator chose substitution and another chose reduction. In addition, the systems do not have to choose the same words used in the reference, as long as the meaning is appropriate.
The determination of substitution or repetition is basically based on the word stem. For example, conversions between voice (e.g., "attack" and "be attacked"), tense (e.g., "study" and "studied") and parts of speech (e.g., "problematic" and "problem") are not considered to be substitutions. Conversions to idioms (e.g., "visit" and "pay a visit") are an exception and are handled as substitutions.
In addition, BLEU scores and style (i.e., non-repetitive or repetitive) estimations using a word aligner are also reported as a reference.
IMPORTANT DATES
Release of the training data (the Jiji 2020 data) |
April 17, 2024 |
Release of the development data (the Jiji 2023 data) |
May 23, 2024 |
Release of the test data (the Jiji 2024 data) |
July 14 15-16, 2024 |
Submission deadline for the task |
July 21, 2024 |
Paper submission deadline |
TBA around August 20 (following EMNLP/WMT) |
Notification of acceptance |
TBA (following EMNLP/WMT) |
Camera-ready deadline |
TBA around October 3 (following EMNLP/WMT) |
Conference |
November 15-16, 2024 |
SUBMISSION FORMAT
Please send your system translation as a file named "YOUR_TEAM_NAME.txt" by email to kinugawa.k-jg@nhk.or.jp and mino.h-gq@nhk.or.jp.
ORGANIZIZERS
-
Kazutaka Kinugawa (kinugawa.k-jg@nhk.or.jp), NHK
-
Hideya Mino (mino.h-gq@nhk.or.jp), NHK
-
Naoto Shirai (shirai.n-hk@nhk.or.jp), NHK
-
Isao Goto (goto.isao.fn@ehime-u.ac.jp), Ehime University
ACKNOWLEDGEMENTS
We are deeply grateful to Hidehiro Asaka and Takayuki Kawakami for providing the valuable data used in this research.
These research results were obtained from the commissioned research (No. 225) by National Institute of Information and Communications Technology (NICT), Japan.
CHANGE LOG
-
2024-07-18: Modify the number of test sentences in Table 4
-
2024-07-15: Update the test data relase date
-
2024-07-05: Update the submisson format
-
2024-07-04: Fix typo in Table 1
-
2024-07-04: Update the test data release date to July 14
-
2024-06-14: Update submission deadline for the task
-
2024-06-05: Add data statistics of dev and test data
-
2024-06-05: Add the explanation about sentence pairs
-
2024-06-05: Add the note for using an external data
-
2024-06-05: Clearfy the names of the training (Jiji 2020), dev (Jiji 2023) and test (Jiji 2024) data
-
2024-05-23: The development data (Jiji 2023) released
-
2024-05-02: Site opened
COPYRIGHT
Copyright©Jiji Press, Ltd. All rights reserved for the example sentences in this document.