Search Descriptions

General

Neural machine Translation

Statistical Machine Translation

Search Publications


author

title

other

year

Sentence Alignment

Translated texts are often found in the form of translated documents or web pages. Since sentences are not always mapped one-to-one, sentence alignment methods are needed.

Sentence Alignment is the main subject of 46 publications. 28 are discussed here.

Publications

Sentence alignment was a very active field of research in the early days of statistical machine translation. An influential early method is based on sentence length, measured in words (Brown et al., 1991; Gale and Church, 1991; Gale and Church, 1993) or characters (Church, 1993). Other methods may use alignment chains (Melamed, 1996; Melamed, 1999), model omissions (Melamed, 1996), distinguish between large-scale segmentation of text an detailed sentence alignment (Simard and Plamondon, 1996), apply line detection method from image processing to detect large-scale alignment patterns (Chang and Chen, 1997; Melamed, 1997).
Kay and Röscheisen (1993) propose an iterative algorithm that uses spelling similarity and word co-occurrences to drive sentence alignment. Several researchers proposed including lexical information (Chen, 1993; Dagan et al., 1993; Utsuro et al., 1994; Wu, 1994; Haruno and Yamazaki, 1996; Chuang and Chang, 2002; Kueng and Su, 2002; Moore, 2002; Nightingale and Tanaka, 2003; Aswani and Gaizauskas, 2005), content words (Papageorgiou et al., 1994), numbers and n-grams (Davis et al., 1995). Sentence alignment may also be improved by a third language in multilingual corpora (Simard, 1999). More effort is needed to align very noisy corpora (Zhao et al., 2003). Different sentence alignment methods are compared by Singh and Husain (2005). Xu et al. (2006) propose a method that iteratively performs binary splits of a document to obtain a sentence alignment. Enright and Kondrak (2007) use a simple and fast method for document alignment that relies of overlap of rare but identically spelled words, which are mostly cognates, names, and numbers.

Benchmarks

Discussion

Related Topics

Extracting parallel sentences from comparable corpora is a much harder challenge.

New Publications

  • UNKNOWN CITATION 'WoÅ‚k2014'
  • Mújdricza-Maydt et al. (2013)
  • Quan et al. (2013)
  • Kutuzov (2013)
  • Krstovski and Smith (2013)
  • Zaidan and Chowdhary (2013)
  • Plamada and Volk (2013)
  • Zhang et al. (2013)
  • Lamraoui and Langlais (2013)
  • Stymne et al. (2013)
  • Stymne et al. (2013)
  • Sennrich and Volk (2010)
  • Shi and Zhou (2008)
  • Mamitimin and Hou (2009)
  • Braune and Fraser (2010)
  • Li et al. (2010)
  • Slayden et al. (2010)
  • Vilar (2005)
  • Chuang et al. (2004)
  • Palmer and Hearst (1997)

Actions

Download

Contribute