Shared Task: Parallel Data Curation

We introduce a new shared task that aims to evaluate the parallel data curation methods. The goal of the task is to find the best MT training data within a provided pile of webcrawled data. We encourage submissions that address any aspect of: document alignment, sentence alignment, comparable corpora bitext filtering, language ID, or related fields.

Announcements

2023-06-29: Baseline NMT scripts and instructions released. Dataset is updated to fix a minor inconsistency in the language id column (more info).
2023-06-15: Dataset released (initial version)

Important dates

Organizers release Data

June 15, 2023

Submissions deadline

September 1, 2023

System paper submission deadline

September 22, 2023

Organizers release final results

September 25, 2023

Camera-ready paper deadline

October 9, 2023

WMT Conference

December 6-7, 2023

All deadlines and release dates are Anywhere on Earth.

Overview/Motivation

A machine translation system is only as good as its training data. The web provides vast amounts of translations that can be used as training data. The challenge is to find pairs of sentences or documents that are translations of each other, which can be used to train the best possible MT system.

For this shared task, the organizers will provide:

Web-crawled data
Intermediate outputs from the baseline to participants to focus on specific aspects of task
MT Training and eval scripts

The participants task is to find the best possible set of training data within the provided web-crawled data to train a downstream MT model, using the provided model training scripts. Downstream MT performance will be judged using automatic MT metrics.
This shared task builds on prior shared tasks on document alignment (WMT 16) and sentence filtering (WMT 18, 19, 20).
Participants may use only pre-trained models and datasets publicly released with a research-friendly license on or before May 1, 2023. All participants are required to submit a system description paper. Similar to the main track, systems that do not adhere to these rules can still be submitted, but participants must identify their systems as "unconstrained" at submission time, and the results will be denoted as such.

Data

We have chosen Estonian-Lithuanian for this shared task. This language pair is chosen to balance the desire for having large enough training data to train a reasonable Estonian→Lithuanian MT model, but small enough to make the task more accessible to participants with limited compute. For this reason, we also release pre-computed intermediate steps from a baseline (e.g. laser embeddings, sentence pairs from FAISS search, etc), so participants can choose to focus on one aspect of the task (e.g. sentence filtering).

To create this dataset, the organizers did the following:

Retrieve a single recent snapshot of CommonCrawl (2023-06).
Extract plain text from HTML documents using trafilatura.
Filter Estonian and Lithuanian documents: using 176-language Fasttext language id model, and the first 2,000 characters of documents, we have removed documents that are not Estonian or Lithuanian.
Remove unsafe and offensive content: based on the blocklist project, the documents from hostnames present in the following lists have been removed: abuse, basic, crypto, drugs, fraud, gambling, malware, phishing, piracy, porn, ransomware, redirect, scam, torrent.
Sentence segmentation: split documents into paragraphs based on line breaks, and then, into sentences using Mediacloud Sentence Splitter.
Assign Unique ID for sentences: for both Estonian and Lithuanian, each sentence is given a unique, randomly generated sentence id. Sentence ids are consistent within the Estonian or Lithuanian input files, but are not the same across them. If an identical sentence occurs in both sides, it would have a different id in the second language.
Classify each sentence’s language using the above mentioned Fasttext language id model.
Statistics: Count number of times each sentence appears in the corpus and how many documents contain a given sentence.

We provide input data in TSV format at both document and sentence level, which can be downloaded with the following script:

Update 2023-06-21

The initial version (2023-06-15) of the sentences files contained some duplicate sentence rows with identical content and ids but different 1-best language id result. These duplicate rows have been removed. The document tsv has been updated to ensure consistency in the language id column for each sentence.

get-data.bash

BASE="https://mtdataexternalpublic.blob.core.windows.net/2023datatask/2023-06-21"

wget $BASE/documents.et.tsv.gz
wget $BASE/documents.lt.tsv.gz

wget $BASE/sentences.et.tsv.gz
wget $BASE/sentences.lt.tsv.gz

The data format is described in the following subsections.

Documents

We provide full, original web documents and metadata, and can be the starting point for any document-based data curation methods.

Each row in TSV corresponds to a sentence in a document. They are sorted based on the url and contain a header with the following columns:

Url: The original url of the document
Hostname: The hostname from the url
DocumentId: A unique identifier for each document (not used for submissions)
ParagraphIdx: The index of the current paragraph in the document. This corresponds to splitting the text on line breaks.
SentenceIdx: The index of each sentence within a paragraph. This corresponds with splitting the paragraph using mediacloud sentence splitter.
Sentence: The text of a given sentence.
SentenceId: A unique identifier for each sentence — submissions will consist of pairs of sentence ids.
LangId: The result of running this sentence through the 176 language fasttext language id model.
SentenceCount: The number of times this sentence occurs in the documents file.
NumDocsContainingSentence: The number of unique documents containing this sentence in the documents file.

Sentences

We provides files which contain only unique sentences from the web documents, and can be the starting point for any sentence-based data curation methods.

These TSV files are sorted based on the sentence id and contain a header with the following columns:

SentenceId: A unique identifier for each sentence — submissions will consist of pairs of sentence ids.
Sentence: The text of a given sentence.
LangId: The result of running this sentence through the 176 language fasttext language id model.
SentenceCount: The number of times this sentence occurs in the documents file.
NumDocsContainingSentence: The number of unique documents containing this sentence in the documents file.

Sentence Id List for Overlap Removal

We provide a plaintext list of sentence ids that may overlap with test and development datasets.

https://mtdataexternalpublic.blob.core.windows.net/2023datatask/2023-06-15/exclude_sent_ids_et-lt.txt (14.1 MB)

This file contains the union of ids across both languages, and will be used to filter data before training baseline systems.

Top-K Cosine Similarity

These files are an intermediate output from our baseline submission.

We filtered our input sentences based on the provided language id column (e.g. keeping only Estonian sentences with an “et” language id), and got sentence embeddings using the LASER 2 model.
Using the FAISS library to index our embeddings for fast retrieval, for each language, we split our embeddings into several chunks (for parallelization), applied L2 normalization to the embeddings, and added them to a flat inner product index, such that the resulting scores are equivalent to cosine similarity.
Query each index with all L2 normalized embeddings in the other language, and stored the top-8 results (locally, per chunk).
Aggregate and sorted across indexes, saving the top-8 results globally.

The results have been computed for both directions: "et-lt" (Estonian queries, Lithuanian results) and "lt-et" (the reverse). Files are in the tsv format, with a header row, and the following columns:

QueryId: Sentence id corresponding to the first language listed in the language pair (i.e., “lt” for “lt-et”), the sentence we used to query our FAISS index.
ResultId1: Sentence id corresponding to the top result from the second language listed in the language pair. (i.e., “et” for “lt-et”)
Score1: Cosine similarity between the LASER2 sentence embedding corresponding to QueryId1 and the LASER2 sentence embedding corresponding to ResultId1
ResultId2: Sentence id corresponding to result 2 from the second language listed in the language pair.
Score2: Cosine similarity between embeddings for our query sentence, and the result 2 sentence.
ResultId3: Sentence id corresponding to result 3 from the second language listed in the language pair.
Score3: Cosine similarity between embeddings for our query sentence, and the result 3 sentence.
ResultId4: Sentence id corresponding to result 4 from the second language listed in the language pair.
Score4: Cosine similarity between embeddings for our query sentence, and the result 4 sentence.
ResultId5: Sentence id corresponding to result 5 from the second language listed in the language pair.
Score5: Cosine similarity between embeddings for our query sentence, and the result 5 sentence.
ResultId6: Sentence id corresponding to result 6 from the second language listed in the language pair.
Score6: Cosine similarity between embeddings for our query sentence, and the result 6 sentence.
ResultId7: Sentence id corresponding to result 7 from the second language listed in the language pair.
Score7: Cosine similarity between embeddings for our query sentence, and the result 7 sentence.
ResultId8: Sentence id corresponding to result 8 from the second language listed in the language pair.
Score8: Cosine similarity between embeddings for our query sentence, and the result 8 sentence.

Due to the large data size, the results are provided in multiple parts, corresponding to the first alphanumeric letter in the sentence ID for each query (i.e. [0-9,a-f]; 16 parts).

These files can be downloaded using the following bash and wget script:

PREFIX="https://mtdataexternalpublic.blob.core.windows.net/2023datatask/2023-06-15/cosine_similarity/cosine_similarity"
for i in {0..9} {a..f}; do
    wget "$PREFIX.et-lt.part_$i.tsv.gz"
    wget "$PREFIX.lt-et.part_$i.tsv.gz"
done

Alternatively, manual download links are also provided below:

LASER 2 Embeddings

We provide LASER 2 embeddings in the form of numpy arrays, with one file per part. These files are parallel with the updated sentence-level tsv files. Each part in the embedding filename corresponds to the first character of the sentence id, so a part_0 file contains embeddings for each sentence whose id starts with '0', sorted by sentence id in alphabetical order.

PREFIX="https://mtdataexternalpublic.blob.core.windows.net/2023datatask/2023-06-29/laser2/laser2"
for i in {0..9} {a..f}; do
    wget $PREFIX.lt.part_$i.npy
    wget $PREFIX.et.part_$i.npy
done

Each part is about 15GB in size. In addition these large files, we also provide lower dimensional (128d) embeddings in half precision (fp16) format, which are about 800MB-1GB per part. These files can be downloaded using the following bash and wget script:

PREFIX="https://mtdataexternalpublic.blob.core.windows.net/2023datatask/2023-06-29/laser2-pca128_fp16/laser2.pca128_fp16"
for i in {0..9} {a..f}; do
    wget $PREFIX.lt.part_$i.npy
    wget $PREFIX.et.part_$i.npy
done

Baseline NMT

We share scripts to train and evaluate a baseline NMT system using Sockeye. Instructions are available at:

https://github.com/awslabs/sockeye/blob/wmt23_data_task/README.md

Submission

An example of the expected format for submissions is:

https://mtdataexternalpublic.blob.core.windows.net/2023datatask/2023-06-15/expected_output_format.et-lt.tsv.gz (20.4 MB)

This file consists of two columns: Estonian SentenceIds and Lithuanian SentenceIds. These sentence pairs were randomly matched with one another, so it is not expected that downstream MT quality will be high if a model is trained using these id pairs.

Update: A small clarification about formatting based on a participant question about many-many, one-many, and many-one alignments: If you have such alignments, please create a comma separated list on each side of the TSV, and we will combine the list of sentence(s) on either side with a single space to create the training data. Example:

SENTID2    SENTID7
SENTID33,SENTID24    SENTID13,SENTID25,SENTID235
SENTID28    SENTID233
SENTID21,SENTID52    SENTID222
SENTID332    SENTID144,SENTID288

Your final submission files should be uploaded here: https://drive.google.com/drive/folders/1v5fPO-gdBVU9t3YagKaUkD9H9SrwUs34 . If you are not able to access Google drive, please contact the organizers mailing list as soon as possible (wmt-data-task-organizers@googlegroups.com).

Please make sure to include an identifier in the name of your submission (e.g. the name of your university or lab). Please also include constrained or unconstrained in the name to indicate the type of system. Systems are unconstrained if used data that was not explicitly allowed for the task. For example, someone who at Awesome University and used allowed data might name their file: awesome_constrained.et-lt.tsv.gz. Someone who works at cool company and submitted a unconstrained system might name their file: cool_unconstrained.et-lt.tsv.gz

System descriptions are due on September 22, and the camera ready deadline is October 9. System descriptions are not required to be anonymized. Please include a footnote in your system description indicating the name of your submission(s), this will allow us to properly cite you in the task overview paper.

Organizers

Tobias Domhan (Amazon)
Thamme Gowda (Microsoft)
Huda Khayrallah (Microsoft)
Philipp Koehn (Johns Hopkins University)
Steve Sloto (Microsoft)
Brian Thompson (Amazon)

To reach the organizers, please email: wmt-data-task-organizers@googlegroups.com
To get updates about the shared task, please join this mailing list: https://groups.google.com/g/wmt-data-task/