Bilingual Corpus Acquisition from the Web

This project is based on Bitextor, a free and open source application whose objective is to generate translation memories (TM) from multilingual websites. It downloads all the HTML files in a given website and applies a set of heuristics (mainly based on the HTML structure and text block length) to make pairs of files which are candidates to comprise a bitext; then, from these candidates TMs are generated in TMX. It is licensed under GNU GPL v2.1.

  1. Evaluating Bitextor results and building a development set(s) to try different heuristics. Changing how text blocks are compared: 1. Compare the number of words in both text blocks; by now, only the number of characters are compared. The idea is to add an option in the configuration file to compare the length of the text blocks in characters, in words, or a combination of both.
  2. Bitextor only considers for comparison two text blocks if the difference between their lengths is below a given (language-pair dependent) percentage threshold. It could be interesting to add another threshold to set the minimum difference in length of two text blocks being consider for comparison. For English-Catalan, we could set, for example, min_length_difference=10, max_length_difference=30.
  3. Using machine translation (MT), maybe in conjunction with a lemmatiser of a part-of-speech tagger, to compare two texts in order to guess whether they are bitexts or not (new heuristic to find bitexts). The idea is to translate one of the texts (left) into the language of the other one (right) and design a method to compare them (maybe, simplifying both texts before the comparison).
  4. Applying the previous approach to the evaluation of the quality of the translation units (TU) of a given TM. The idea here is not to compare the texts, but the TUs obtained after processing a bitext. The objective is to design a system to evaluate how probable is that the TM is correct and which TUs are likely to be incorrect. This information may be used to guide the user in correcting the resulting TM.
  5. Adding a preprocessing step to clean irrelevant content (menus, titles, footnote copyright/license information, etc.) from a group of files with very similar HTML structure. In the literature we can find some previous work on cleaning irrelevant information in web pages1 that could be integrated in Bitextor.
  6. Finding a substitute for LibEnca (character encoding detector): It seems that, sometimes, LibEnca has problems to detect the character encoding correctly. It would be interesting to find a better option for this (maybe libmagic?).
  7. Trying to combine Bitextor with OmegaT beyond the obvious usage of the XML files.
  8. Modifying LibTagAligner ---the library used by Bitextor to align a bitext using the HTML structure and to generate a TM--- so as to be able to decide when to join two or more different sentences when aligning them as done by the algorithm by Gale and Church (1993).
  9. Installing Bitextor in Windows: Bitextor has been created and designed to run on Unix-like systems. So, the idea would be to try to compile, install and run Bitextor on MS Windows.

[1] Yi, L. and Liu, B. and Li, X (2003). Eliminating noisy information in web pages for data mining. Proceedings of the Ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 296--305

Page last modified on January 25, 2010, at 10:32 AM