Using Linguistic Annotations in
Statistical Machine Translation of Film Subtitles
Christian Hardmeier Martin Volk
Fondazione Bruno Kessler Universität Zürich
Human Language Technologies Inst. für Computerlinguistik
Via Sommarive, 18 Binzmühlestrasse 14
I-38050 Povo (Trento) CH-8050 Zürich
hardmeier@fbk.eu volk@cl.uzh.ch
Abstract 1 million subtitles of the training corpus used by
Volk and Harder was morphologically annotated
Statistical Machine Translation (SMT) has
with the DanGram parser (Bick, 2001). We in-
been successfully employed to support
tegrated the annotations into the translation pro-
translation of film subtitles. We explore
cess using the methods of factored Statistical Ma-
the integration of Constraint Grammar
chine Translation (Koehn and Hoang, 2007) im-
corpus annotations into a Swedish Danish
plemented in the widely used Moses software. Af-
subtitle SMT system in the framework of
ter describing the corpus data and giving a short
factored SMT. While the usefulness of the
overview over the methods used, we present a
annotations is limited with large amounts
number of experiments comparing different fac-
of parallel data, we show that linguistic an-
tored SMT setups. The experiments are then repli-
notations can increase the gains in transla-
cated with reduced training corpora which contain
tion quality when monolingual data in the
only part of the available training data. These se-
target language is added to an SMT system
ries of experiments provide insights about the im-
based on a small parallel corpus.
pact of corpus size on the effectivity of using lin-
1 Introduction
guistic abstractions for SMT.
In countries where foreign-language films and se-
2 Machine translation of subtitles
ries on television are routinely subtitled rather than
dubbed, there is a considerable demand for effi- As a text genre, subtitles play a curious role in
ciently produced subtitle translations. Although a complex environment of different media and
superficially it may seem that subtitles are not ap- modalities. They depend on the medium film,
propriate for automatic processing as a result of which combines a visual channel with an audi-
their literary character, it turns out that their typi- tive component composed of spoken language and
cal text structure, characterised by brevity and syn- non-linguistic elements such as noise or music.
tactic simplicity, and the immense text volumes Within this framework, they render the spoken di-
processed daily by specialised subtitling compa- alogue into written text, are blended in with the vi-
nies make it possible to produce raw translations sual channel and displayed simultaneously as the
of film subtitles with statistical methods quite ef- original sound track is played back, which redun-
fectively. If these raw translations are subse- dantly contains the same information in a form
quently post-edited by skilled staff, production that may or may not be accessible to the viewer.
quality translations can be obtained with consider- In their linguistic form, subtitles should be faith-
ably less effort than if the subtitles were translated ful, both in contents and in style, to the film dia-
by human translators with no computer assistance. logue which they represent. This means in partic-
A successful subtitle Machine Translation sys- ular that they usually try to convey an impression
tem for the language pair Swedish Danish, which of orality. On the other hand, they are constrained
has now entered into productive use, has been pre- by the mode of their presentation: short, written
sented by Volk and Harder (2007). The goal of the captions superimposed on the picture frame.
present study is to explore whether and how the According to Becquemont (1996), the charac-
quality of a Statistical Machine Translation (SMT) teristics of subtitles are governed by the inter-
system of film subtitles can be improved by us- play of two conflicting principles: unobtrusive-
ing linguistic annotations. To this end, a subset of ness (discrétion) and readability (lisibilité). In
Kristiina Jokinen and Eckhard Bick (Eds.)
NODALIDA 2009 Conference Proceedings, pp. 57 64
Christian Hardmeier and Martin Volk
order to provide a satisfactory experience to the trusiveness and readability, they are not very fre-
viewers, it is paramount that the subtitles help quent.
them quickly understand the meaning of the dia- It is worth noting that, unlike rule-based Ma-
logue without distracting them from enjoying the chine Translation systems, a statistical system
film. The amount of text that can be displayed at does not in general have any difficulties translat-
one time is limited by the area of the screen that ing ungrammatical or fragmentary input: phrase-
may be covered by subtitles (usually no more than based SMT, operating entirely on the level of
two lines) and by the minimum time the subtitle words and word sequences, does not require the
must remain on screen to ensure that it can actually input to be amenable to any particular kind of lin-
be read. As a result, the subtitle text must be short- guistic analysis such as parsing. Whilst this ap-
ened with respect to the full dialogue text in the proach makes it difficult to handle some linguistic
actors script. The extent of the reduction depends challenges such as long-distance dependencies, it
on the script and on the exact limitations imposed has the advantage of making the system more ro-
for a specific subtitling task, but may amount to bust to unexpected input, which is more important
as much as 30 % and reach 50 % in extreme cases for subtitles.
(Tomaszkiewicz, 1993, 6). We have only been able to sketch the character-
istics of the subtitle text genre in this paper. Díaz-
As a result of this processing and the consid-
Cintas and Remael (2007) provide a detailed intro-
erations underlying it, subtitles have a number of
duction, including the linguistics of subtitling and
properties that make them especially well suited
translation issues, and Pedersen (2007) discusses
for Statistical Machine Translation. Owing to their
the peculiarities of subtitling in Scandinavia.
presentational constraints, they mainly consist of
comparatively short and simple phrases. Current
3 Constraint Grammar annotations
SMT systems, when trained on a sufficient amount
To explore the potential of linguistically annotated
of data, have reliable ways of handling word trans-
data, our complete subtitle corpus, both in Danish
lation and local structure. By contrast, they are
and in Swedish, was linguistically analysed with
still fairly weak at modelling long-range depen-
the DanGram Constraint Grammar (CG) parser
dencies and reordering. Compared to other text
(Bick, 2001), a system originally developed for
genres, this weakness is less of an issue in the Sta-
the analysis of Danish for which there is also a
tistical Machine Translation of subtitles thanks to
Swedish grammar. Constraint Grammar (Karls-
their brevity and simple structure. Indeed, half
son, 1990) is a formalism for natural language
of the subtitles in the Swedish part of our par-
parsing. Conceptually, a CG parser first produces
allel training corpus are no more than 11 tokens
possible analyses for each word by considering its
long, including two tokens to mark the beginning
morphological features and then applies constrain-
and the end of the segment and counting every
ing rules to filter out analyses that do not fit into
punctuation mark as a separate token. A consider-
the context. Thus, the word forms are gradually
able number of subtitles only contains one or two
disambiguated, until only one analysis remains;
words, besides punctuation, often consisting en-
multiple analyses may be retained if the sentence
tirely of a few words of affirmation, negation or
is ambiguous.
abuse. These subtitles can easily be translated by
The annotations produced by the DanGram
an SMT system that has seen similar examples be-
parser were output as tags attached to individual
fore.
words as in the following example:
The orientation of the genre towards spoken lan-
guage also has some disadvantages for Machine $-
Translation systems. It is possible that the lan- Vad [vad]
INDP NEU S NOM @ACC>
vet [veta] V PR AKT @FS-QUE
guage of the subtitles, influenced by characteris-
du [du] PERS 2S UTR S NOM @tics of speech, contains unexpected features such
om [om] PRP @as stutterings, word repetitions or renderings of
det [den] PERS NEU 3S ACC @P<
non-standard pronunciations that confuse the sys-
$?
tem. Such features are occasionally employed by
subtitlers to lend additional colour to the text, but In addition to the word forms and the accompany-
as they are in stark conflict with the ideals of unob- ing lemmas (in square brackets), the annotations
58
Using Linguistic Annotations in Statistical Machine Translation of Film Subtitles
contained part-of-speech (POS) tags such asINDP language requires a different word order, reorder-
for independent pronoun orVfor verb , a mor- ing is possible at the cost of a score penalty. The
phological analysis for each word (such as NEU S translation model has no notion of sequence, so
NOM for neuter singular nominative ) and a tag it cannot control reordering. The language model
specifying the syntactic function of the word in can, but it has no access to the source language
the sentence (such as @ACC>, indicating that the text, so it considers word order only from the point
sentence-initial pronoun is an accusative object of of view of TL grammaticality and cannot model
the following verb). For some words, more fine- systematic differences in word order between two
grained part-of-speech information was specified languages. Lexical reordering models (Koehn et
in angle brackets, such asfor interrog- al., 2005) address this issue in a more explicit way
ative pronoun or for verb of movement . by modelling the probability of certain changes in
In our experiments, we used word forms, lemmas, word order, such as swapping words, conditioned
POS tags and morphological analyses. The fine- on the source and target language phrase pair that
grained POS tags and the syntax tags were not is being processed.
used.
In its basic form, Statistical Machine Transla-
tion treats word tokens as atomic and does not
4 Factored Statistical Machine
permit further decomposition or access to single
Translation
features of the words. Factored SMT (Koehn and
Hoang, 2007) extends this model by represent-
Statistical Machine Translation formalises the
ing words as vectors composed of a number of
translation process by modelling the probabilities
features and makes it possible to integrate word-
of target language (TL) output strings T given a
level annotations such as those produced by a Con-
source language (SL) input string S, p(T |S), and
straint Grammar parser into the translation pro-
Ć
conducting a search for the output string T with
cess. The individual components of the feature
the highest probability. In the Moses decoder
vectors are called factors. In order to map be-
(Koehn et al., 2007), which we used in our exper-
tween different factors on the target language side,
iments, this probability is decomposed into a log-
the Moses decoder works with generation mod-
linear combination of a number of feature func-
els, which are implemented as dictionaries and ex-
tions hi(S,T), which map a pair of a source and a
tracted from the target-language side of the train-
target language element to a score based on differ-
ing corpus. They can be used, e. g., to generate
ent submodels such as translation models or lan-
word forms from lemmas and morphology tags, or
guage models. Each feature function is associated
to transform word forms into part-of-speech tags,
with a weight i that specifies its contribution to
which could then be checked using a language
the overall score:
model.
Ć
T = argmax log p(T |S)
T
5 Experiments with the full corpus
= argmax i hi(S,T)
"
T
i
We ran three series of experiments to study the
effects of different SMT system setups on trans-
The translation models employed in factored
lation quality with three different configurations
SMT are phrase-based. The phrases included in
a translation model are extracted from a word- of training corpus sizes. For each condition, sev-
aligned parallel corpus with the techniques de- eral Statistical Machine Translation systems were
trained and evaluated.
scribed by Koehn et al. (2003). The associated
probabilities are estimated by the relative frequen- In the full data condition, the complete system
cies of the extracted phrase pairs in the same cor- was trained on a parallel corpus of some 900,000
pus. For language modelling, we used the SRILM subtitles with source language Swedish and target
toolkit (Stolcke, 2002); unless otherwise specified, language Danish, corresponding to around 10 mil-
6-gram language models with modified Kneser- lion tokens in each language. The feature weights
Ney smoothing were used. were optimised using minimum error rate train-
The SMT decoder tries to translate the words ing (Och, 2003) on a development set of 1,000
and phrases of the source language sentence in the subtitles that had not been used for training, then
order in which they occur in the input. If the target the system was evaluated on a 10,000 subtitle test
59
Christian Hardmeier and Martin Volk
set that had been held out during the whole de- ditional language models helped to rule out this
velopment phase. The translations were evalu- error and correctly translate mitt emot as over for,
ated with the widely used BLEU and NIST scores yielding a much better translation. Neither of them
(Papineni et al., 2002; Doddington, 2002). The output the adverb lige just found in the reference
outcomes of different experiments were compared translation, for which there is no explicit equiva-
with a randomisation-based hypothesis test (Co- lent in the input sentence.
hen, 1995, 165 177). The test was two-sided, and
In the next example, the POS and the morphol-
the confidence level was fixed at 95 %.
ogy language model produced different output:
The results of the experiments can be found in
Input: Dåliga kontrakt, dålig ledning, dåliga agenter.
table 1. The baseline system used only a transla-
Reference: Dårlige kontrakter, dårlig styring, dårlige
tion model operating on word forms and a 6-gram
agenter.
language model on word forms. This is a stan-
Baseline: Dårlige kontrakt, dårlig forbindelse, dårlige
dard setup for an unfactored SMT system. Two
agenter.
systems additionally included a 6-gram language
POS: Dårlige kontrakt, dårlig ledelse, dårlige agenter.
model operating on part-of-speech tags and a 5- Morphology: Dårlige kontrakter, dårlig forbindelse,
dårlige agenter.
gram language model operating on morphology
tags, respectively. The annotation factors required
In Swedish, the indefinite singular and plu-
by these language models were produced from the
ral forms of the word kontrakt contract(s) are
word forms by suitable generation models.
homonymous. The two SMT systems without sup-
In the full data condition, both the part-
port for morphological analysis incorrectly pro-
of-speech and the morphology language model
duced the singular form of the noun in Danish.
brought a slight, but statistically significant gain
The morphology language model recognised that
in terms of BLEU scores, which indicates that
the plural adjective dårlige bad is more likely
abstract information about grammar can in some
to be followed by a plural noun and preferred
cases help the SMT system choose the right words.
the correct Danish plural form kontrakter con-
The improvement is small; indeed, it is not re-
tracts . The different translations of the word
flected in the NIST scores, but some beneficial ef-
ledning as management or connection can be
fects of the additional language models can be ob-
pinned down to a subtle influence of the generation
served in the individual output sentences.
model probability estimates. They illustrate how
One thing that can be achieved by taking word
sensitive the system output is in the face of true
class information into account is the disambigua-
ambiguity. None of the systems presented here has
tion of ambiguous word forms. Consider the fol-
the capability of reliably choosing the right word
lowing example:
based on the context in this case.
Input: Ingen vill bo mitt emot en ismaskin.
In three experiments, the baseline configuration
Reference: Ingen vil bo lige over for en ismaskine.
was extended by adding lexical reordering mod-
Baseline: Ingen vil bo mit imod en ismaskin.
els conditioned on word forms, lemmas and part-
POS/Morphology: Ingen vil bo over for en ismaskin.
of-speech tags, respectively. As in the language
model experiments, the required annotation fac-
Since the word ismaskin ice machine does not
tors on the TL side were produced by generation
occur in the Swedish part of the training corpus,
models.
none of the SMT systems was able to translate it.
All of them copied the Swedish input word liter- The lexical reordering models turn out to be
ally to the output, which is a mistake that cannot useful in the full data experiments only when con-
be fixed by a language model. However, there is a ditioned on word forms. When conditioned on
clear difference in the translation of the phrase mitt lemmas, the score is not significantly different
emot opposite . For some reason, the baseline from the baseline score, and when conditioned on
system chose to translate the two words separately part-of-speech tags, it is significantly lower. In this
and mistakenly interpreted the adverb mitt, which case, the most valuable information for lexical re-
is part of the Swedish expression, as the homony- ordering lies in the word form itself. Lemma and
mous first person neuter possessive pronoun my , part of speech are obviously not the right abstrac-
translating the Swedish phrase as ungrammatical tions to model the reordering processes when suf-
Danish mit imod my against . Both of the ad- ficient data is available.
60
Using Linguistic Annotations in Statistical Machine Translation of Film Subtitles
Table 1 Experimental results
full data symmetric asymmetric
BLEU NIST BLEU NIST BLEU NIST
Baseline 53.67 % 8.18 42.12 % 6.83 44.85 % 7.10
Language models
parts of speech 53.90 % 8.17 42.59 % 6.87 ć% 44.71 % 7.08
morphology 54.07 % 8.18 42.86 % 6.92 44.95 % 7.09
Lexical reordering
word forms 53.99 % 8.21 42.13 % 6.83 ć% 44.72 % 7.05
lemmas 53.59 % 8.15 42.30 % 6.86 ć% 44.71 % 7.06
parts of speech ć% 53.36 % 8.13 42.33 % 6.86 ć% 44.63 % 7.05
Analytical translation 53.73 % 8.18 42.28 % 6.90 46.73 % 7.34
BLEU score significantly above baseline (p < .05)
ć% BLEU score significantly below baseline (p < .05)
Another system, which we call the analytical put produced by the analytical system is superior
translation system, was modelled on suggestions to that of the baseline system. Where the base-
by Koehn and Hoang (2007) and Bojar (2007). It line system copied the Swedish word bröllops-
used the lemmas and the output of the morpholog- fotona wedding photos literally into the Dan-
ical analysis to decompose the translation process ish text, the translation found by the analytical
and use separate components to handle the transfer model, bryllupsbillederne wedding pictures , is
of lexical and grammatical information. In order both semantically and syntactically flawless. Un-
to achieve this, the baseline system was extended fortunately, the reference translation uses different
with additional translation tables mapping SL lem- words, so the evaluation scores will not reflect this
mas to TL lemmas and SL morphology tags to TL improvement.
morphology tags, respectively. In the target lan-
The lack of success of analytical translation in
guage, a generation model was used to transform
terms of evaluation scores can be ascribed to at
lemmas and morphology tags into word forms.
least three factors: Firstly, there are relatively few
The results reported by Koehn and Hoang (2007)
vocabulary gaps in our data, which is due to the
strongly indicate that this translation approach is
size of training corpus. Only 1.19 % (1,311 of
not sufficient on its own; instead, the decomposed
109,823) of the input tokens are tagged as un-
translation approach should be combined with a
known by the decoder in the baseline system. As
standard word form translation model so that one
a result, there is not much room for improvement
can be used in those cases where the other fails.
with an approach specifically designed to handle
This configuration was therefore adopted for our
vocabulary coverage, especially if this approach
experiments.
itself fails in some of the cases missed by the base-
The analytical translation approach fails to
line system: Analytical translation brings this fig-
achieve any significant score improvement with
ure down to 0.88 % (970 tokens), but no further.
the full parallel corpus. Closer examination of
Secondly, employing generation tables trained on
the MT output reveals that the strategy of using
the same corpus as the translation tables used by
lemmas and morphological information to trans-
the system limits the attainable gains from the out-
late unknown word forms works in principle, as
set, since a required word form that is not found in
shown by the following example:
the translation table is likely to be missing from
Input: Molly har visat mig bröllopsfotona. the generation table, too. Thirdly, in case of vo-
Reference: Molly har vist mig fotoene fra brylluppet.
cabulary gaps in the translation tables, chances
Baseline: Molly har vist mig bröllopsfotona.
are that the system will not be able to produce
Analytical: Molly har vist mig bryllupsbillederne.
the optimal translation for the input sentence. In-
In this sentence, there can be no doubt that the out- stead, an approach like analytical translation aims
61
Christian Hardmeier and Martin Volk
to find the best translation that can be derived from expect the translation quality to be severely im-
the available models, which is certainly a reason- paired by data sparseness issues, making it diffi-
able thing to do. However, when only one refer- cult for the Machine Translation system to handle
ence translation is used, current evaluation meth- unseen data. This prediction is supported by the
ods will not allow alternative solutions, uniformly experiments: The scores are improved by all ex-
penalising all deviating translations instead. While tensions that allow the model to deal with more ab-
using more reference translations could potentially stract representations of the data and thus to gen-
alleviate this problem, multiple references are ex- eralise more easily. The highest gains in terms of
pensive to produce and just not available in many BLEU and NIST scores result from the morphol-
situations. Consequently, there is a systematic bias ogy language model, which helps to ensure that
against the kind of solutions analytical translation the TL sentences produced by the system are well-
can provide: Often, the evaluation method will as- formed.
sign the same scores to untranslated gibberish as
Interestingly enough, the relative performance
to valid attempts at translating an unknown word
of the lexical reordering models runs contrary to
with the best means available.
the findings obtained with the full corpus. Lexi-
cal reordering models turn out to be helpful when
6 Experiments with reduced corpora
conditioned on lemmas or POS tags, whereas lex-
ical reordering conditioned on word forms nei-
We tested SMT systems trained on reduced cor-
ther helps nor hurts. This is probably due to the
pora in two experimental conditions. In the sym-
fact that it is more difficult to gather satisfactory
metric condition, the systems described in the pre-
information about reordering from the small cor-
vious section were trained on a parallel corpus of
pus. The reordering probabilities can be estimated
9,000 subtitles, or around 100,000 tokens per lan-
more reliably after abstracting to lemmas or POS
guage, only. This made it possible to study the
tags.
behaviour of the systems with little data. In the
In the asymmetric condition, the same phrase
asymmetric condition, the small 9,000 subtitle par-
tables and lexical reorderings as in the symmetric
allel corpus was used to train the translation mod-
condition were used, but the generation tables and
els and lexical reordering models. The generation
language models were trained on a TL corpus 100
and language models, which only rely on mono-
times as large. The benefit of this larger corpus is
lingual data in the target language, were trained
obvious already in the baseline experiment, which
on the full 900,000 subtitle dataset in this condi-
is completely identical to the baseline experiment
tion. This setup simulates a situation in which it
of the symmetric condition except for the language
is difficult to find parallel data for a certain lan-
model. Clearly, using additional monolingual TL
guage pair, but monolingual data in the target lan-
data for language modelling is an easy and effec-
guage can be more easily obtained. This is not un-
tive way to improve an SMT system.
likely when translating from a language with few
electronic resources into a language like English,
Furthermore, the availability of a larger data set
for which large amounts of corpus data are readily
on the TL side brings about profound changes in
available.
the relative performance of the individual systems
The results of the experiments with reduced cor- with respect to each other. The POS language
pora follow a more interesting pattern. First of
model, which proved useful in the symmetric con-
all, it should be noted that the experiments in the
dition, is detrimental now. The morphology lan-
asymmetric condition consistently outperformed
guage model does improve the BLEU score, but
those in the symmetric condition. Evidently, Sta- only by a very small amount, and the effect on the
tistical Machine Translation benefits from addi- NIST score is slightly negative. This indicates that
tional data, even if it is only available in the target
the language model operating on word forms is su-
language.
perior to the abstract models when it is trained on
The training corpus of 9,000 segments or sufficient data. Likewise, all three lexical reorder-
100,000 tokens per language used in the symmet- ing models hurt performance in the presence of a
ric experiments is extremely small for SMT; in strong word form language model. Apparently,
comparison to the training sets used in most other when the language model is good, nothing can
studies, this set is tiny. Consequently, one would be gained by having a doubtful reordering model
62
Using Linguistic Annotations in Statistical Machine Translation of Film Subtitles
trained on insufficient data compete against it. When a large training corpus of around 900,000
The most striking result in the asymmetric con- subtitles or 10 million tokens per language was
dition, however, is the score of the analytical trans- used, the gains from adding linguistic information
were generally small. Minor improvements were
lation model, which achieved an improvement of
observed when using additional language models
impressive 1.9 percentage points in the BLEU
operating on part-of-speech tags and tags from
score along with an equally noticeable increase of
morphological analysis. A technique called an-
the NIST score. In the asymmetric setup, where
alytical translation, which enables the SMT sys-
the generation model has much better vocabulary
coverage than the phrase tables, analytical transla- tem to back off to separate translation of lem-
mas and morphological tags when the main phrase
tion realises its full potential and enables the SMT
system to produce word forms it could not other- table does not provide a satisfactory translation,
afforded slightly improved vocabulary coverage.
wise have found.
Lexical reordering conditioned on word forms also
In sum, enlarging the size of the target language
brought about a minor improvement, whereas con-
corpus resulted in a gain of 2.7 percentage points
ditioning lexical reordering on more abstract cat-
BLEU on the baseline score of the symmetric con-
egories such as lemmas or POS tags had a detri-
dition, which is entirely due to the better language
mental effect.
model on word forms and can be realised with-
out linguistic analysis of the input. By integrat-
On the whole, none of the gains was large
ing morphological analysis and lemmas for both
enough to justify the cost and effort of produc-
the SL and the TL part of the corpus, the lever-
ing the annotations. Moreover, there was a clear
age of the additional data can be increased even
tendency for complex models to have a negative
further by analytical translation, realising another
effect when the information employed was not se-
improvement of 1.9 percentage points, totalling
lected carefully enough. When the corpus is large
4.6 percentage points over the initial baseline.
and its quality good, there is a danger of obstruct-
ing the statistical model from taking full advantage
7 Conclusion
of the data by imposing clumsily chosen linguistic
categories. Given sufficient data, enforcing man-
Subject to a set of peculiar practical constraints,
ually selected categories which may not be fully
the text genre of film subtitles is characterised
appropriate for the task in question is not a promis-
by short sentences with a comparatively simple
ing approach. Better results could possibly be ob-
structure and frequent reuse of similar expres-
tained if abstract categories specifically optimised
sions. Moreover, film subtitles are a text genre de-
for the task of modelling distributional character-
signed for translation; they are translated between
istics of words were statistically induced from the
many different languages in huge numbers. Their
corpus.
structural properties and the availability of large
amounts of data make them ideal for Statistical The situation is different when the corpus is
Machine Translation. The present report inves- small. In a series of experiments with a corpus
tigates the potential of incorporating information size of only 9,000 subtitles or 100,000 tokens per
from linguistic analysis into the Swedish Danish language, different manners of integrating linguis-
phrase-based SMT system for film subtitles pre- tic information were consistently found to be ben-
sented by Volk and Harder (2007). It is based on a eficial, even though the improvements obtained
subset of the data used by Volk and Harder, which were small. When the corpus is not large enough
has been extended with linguistic annotations in to afford reliable parameter estimates for the sta-
the Constraint Grammar framework produced by tistical models, adding abstract data with richer
the DanGram parser (Bick, 2001). We integrated statistics stands to improve the behaviour of the
the annotations into the SMT system using the fac- system. Compared to the system trained on the
tored approach to SMT (Koehn and Hoang, 2007) full corpus, the effects involve a trade-off between
as offered by the Moses decoder (Koehn et al., the reliability and usefulness of the statistical es-
2007) and explored the opportunities offered by timates and of the linguistically motivated anno-
factored SMT with a number of experiments, each tation, respectively; the difference in the results
adding a single additional component into the sys- stems from the fact that the quality of the statisti-
tem. cal models strongly depends on the amount of data
63
Christian Hardmeier and Martin Volk
available, whilst the quality of the linguistic anno- Papers presented to the 13th International confer-
ence on Computational Linguistics, pages 168 173,
tation is about the same regardless of corpus size.
Helsinki.
The close relationship of Swedish and Danish may
also have impact: For language pairs with greater
Philipp Koehn and Hieu Hoang. 2007. Factored trans-
lation models. In Conference on empirical methods
grammatical differences, the critical corpus size at
in Natural Language Processing, pages 868 876,
which the linguistic annotations we worked with
Prague.
stop being useful may be larger.
Our most encouraging findings come from ex- Philipp Koehn, Franz Josef Och, and Daniel Marcu.
2003. Statistical phrase-based translation. In Pro-
periments in an asymmetric setting, where a very
ceedings of the 2003 conference of the North Amer-
small SL corpus (9,000 subtitles) was combined
ican chapter of the Association for Computational
with a much larger TL corpus (900,000 subtitles).
Linguistics on Human Language Technology, pages
A considerable improvement to the score was re- 48 54, Edmonton.
alised just by adding a language model trained on
Philipp Koehn, Amittai Axelrod, et al. 2005. Ed-
the larger corpus, which does not yet involve any
inburgh system description for the 2005 IWSLT
linguistic annotations. With the help of analytical
speech translation evaluation. In International
translation, however, the annotations could be suc- workshop on spoken language translation, Pitts-
burgh.
cessfully exploited to yield a further gain of almost
2 percentage points in the BLEU score. Unlike
Philipp Koehn, Hieu Hoang, et al. 2007. Moses: open
the somewhat dubious improvements in the other source toolkit for statistical machine translation. In
Annual meeting of the Association for Computa-
two conditions, this is clearly worth the effort,
tional Linguistics: Demonstration session, pages
and it demonstrates that factored Statistical Ma-
177 180, Prague.
chine Translation can be successfully used to im-
Franz Josef Och. 2003. Minimum error rate training
prove translation quality by integrating additional
in Statistical Machine Translation. In Proceedings
monolingual data with linguistic annotations into
of the 41st annual meeting of the Association for
an SMT system.
Computational Linguistics, pages 160 167, Sapporo
(Japan).
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
References
Jing Zhu. 2002. BLEU: a method for automatic
Daniel Becquemont. 1996. Le sous-titrage cinémato-
evaluation of machine translation. In Proceedings of
graphique : contraintes, sens, servitudes. In Yves
the 40th annual meeting of the Association for Com-
Gambier, editor, Les transferts linguistiques dans les
putational Linguistics, pages 311 318, Philadelphia.
médias audiovisuels, pages 145 155. Presses uni-
ACL.
versitaires du Septentrion, Villeneuve d Ascq.
Jan Pedersen. 2007. Scandinavian subtitles. A com-
Eckhard Bick. 2001. En Constraint Grammar parser
parative study of subtitling norms in Sweden and
for dansk. In 8. MÅ‚de om udforskningen af dansk
Denmark with a focus on extralinguistic cultural ref-
sprog, pages 40 50, Århus.
erences. Ph.D. thesis, Stockholm University, De-
partment of English.
OndYej Bojar. 2007. English-to-Czech factored Ma-
chine Translation. In Proceedings of the Second
Andreas Stolcke. 2002. SRILM: an extensible lan-
Workshop on Statistical Machine Translation, pages
guage modeling toolkit. In Proceedings of the Inter-
232 239, Prague.
national Conference on Spoken Language Process-
ing, Denver (Colorado).
Paul R. Cohen. 1995. Empirical methods for Artificial
Intelligence. MIT Press, Cambridge (Mass.).
Teresa Tomaszkiewicz. 1993. Les opérations linguis-
Jorge Díaz-Cintas and Aline Remael. 2007. Audio- tiques qui sous-tendent le processus de sous-titrage
visual Translation: Subtitling, volume 11 of Trans- des films. Wydawnictwo Naukowe UAM, Poznań.
lation Practices Explained. St. Jerome Publishing,
Martin Volk and SÅ‚ren Harder. 2007. Evaluating MT
Manchester.
with translations or translators. What is the differ-
George Doddington. 2002. Automatic evaluation ence? In Proceedings of MT Summit XI, pages 499
of machine translation quality using n-gram co- 506, Copenhagen.
occurrence statistics. In Proceedings of the second
International conference on Human Language Tech-
nology Research, pages 138 145, San Diego.
Fred Karlsson. 1990. Constraint Grammar as a frame-
work for parsing running text. In COLING-90.
64 ISSN 1736-6305 Vol. 4
http://hdl.handle.net/10062/9206
Wyszukiwarka
Podobne podstrony:
Machine Production of Screen Subtitles for Large Scale Production
Using Markov Chains to Filter Machine morphed Variants of Malicious Programs
Machine Translation A Contrastive Linguistics Perspective
In Vitro Anticancer Activity of Ethanolic Extract
Stephen King A Bedroom In The Wee Hours Of The Morning
16 Changes in sea surface temperature of the South Baltic Sea (1854 2005)
Medicine in the (trans)formation of wrong bodies
In Fire Forged Worlds of Honor
Assessment of Hazard in the Manual Handling of Explosives Initiator Devices
Numerical analysis in the shock synthesis of EuBa$ 2$Cu$ 3$O$ y$
In vitro cytotoxicity screening of wild plant extracts
Recent advances in drying and dehydration of fruits and vegetables a review (Sagar V, Suresh Kumar P
In vitro antitumor actions of extracts
Audiovisual translation of feature films eng lithuanian
using design patterns in game engines
2002 05 Xinetd Control What Comes in and Goes Out of Your Computer
In vivo MR spectroscopy in diagnosis and research of
więcej podobnych podstron