2014-05-06
1
Introduction to linguistics
Lecture 11: Computational linguistics
Sources
• Fromkin, Victoria, Robert Rodman, Nina
Hyams. 2003. An introduction to language.
– Chapter 9: Humans and Computers.
•
(a collection of various
corpora)
2
Language and computers
• Computational linguistics (CL)
– a subfield of
linguistics and computer science.
• It describes the interactionof human language
and computers.
• It overlaps with the field of
artificial
intelligence
(AI), a branch of computer science
aiming at computational models of human
cognition.
3
The scope of CL
• CL includes the computer performing:
– The analysis of written text and spoken discourse.
– The translation of text and speech from one
language to another.
– The use of human languages for communication
between computers and people.
– The modelling and testing of linguistic theories.
4
Text analysis
• Computers prove really helpful when handling
texts. They allow to:
– manipulate data easily and rapidly (searching,
sorting, etc.);
– process data accurately and consistenlty;
– automatically annotate data, i.e. to add notes to a
text.
5
Corpora
• CORPUS
– (from Latin
corpus
, 'body'; pl.
corpora) traditionally indicates a collection of
texts, esp. complete and self-contained, e.g.
The Corpus of Anglo-Saxon Verse
.
• In linguistics and lexicography – a body of
texts, utterances, or other specimens
considered more or less representative of a lg
and stored as an electronic database.
6
2014-05-06
2
Corpora
• Some purposes that corpora serve:
– collection of examples for linguists;
– data resource for lexicographers;
– instruction material for language teachers and learners;
– training material for natural language processing (NLP).
• Applications of corpora:
– training of speech recognizers;
– training of statistical part-of-speech taggers and parsers;
– training of example-based and statistical machine;
– translation systems.
7
Corpus linguistics
• Corpus linguistics
– the study of language as
expressed in samples (corpora) of "real world"
text.
• For linguistic purposes, corpora can help
investigate such questions as:
– What is the order of different types of adjectives in
English?
– With what frequency do older speakers in the
midwest use cool?
– What do you say in English:
think about
or
think on
?
• According to Google (06.05.2014):
think about
-
4,200,000,000 results;
think on
- 3,880,000,000 results.
8
An example of a corpus
• BNC
(on-line since 1995) is a collection of a
about 100 million word samples of written
and spoken lg from various sources.
– It is designed to represent a wide cross-section of
current British English.
– Single words or phrases can be looked up:
• http://www.natcorp.ox.ac.uk/
9
10
Plain corpora
• Some corpora are
plain
– i.e. without any
information about the text:
– e.g. Project Gutenberg texts were produced by
scanning.
– Then the texts were converted into a collection of
public domain e-books.
– As of March 2014, Project Gutenberg claimed over
45,000 items in its collection.
•
(free e-books)
•
https://archive.org/details/gutenberg
11
Corpora and Machine Translation
• Corpora provide actual lg tokens or extract
translation equivalents for Machine
Translation programs.
• Machine Translation (MT)
– a subfield of
computational linguistics that investigates the
use of software to translate text or speech
from one natural language to another.
12
2014-05-06
3
Machine translation
• Translation is hard for a computer.
• The computer has to:
– ‟understand” source text;
– Convert it into target language;
– Generate correct target text.
• The procedure looks simple but, in fact, it is a
complex cognitive operation:
– Many translation problems require real-world
knowledge and intuitions about the meaning of the
text.
13
Understanding the source text
• Lexical ambiguity
– At morphological level:
• Ambiguity of word vs stem+ending (
tower, flower
);
• Inflections are ambiguous (
books, loaded
);
• Derived form may be lexicalised (
meeting, revolver
).
– Lexicalization = adding words, set phrases, or word patterns to
a language.
– Grammatical category ambiguity (e.g.
round
).
– Homonymy:
• Alternativemeanings within the same expression.
14
Understanding the source text
• Syntactic ambiguity
– Due to combination of grammatically ambiguous
words, e.g.:
• Time flies like an arrow, fruit flies like a banana
– Due to alternative interpretations of structure,
e.g.:
• The man saw the girl with a telescope
• In addition, there are problems resulting from
differences between languages.
15
Machine translation
• It is difficult to get a literary quality
translation.
• Today’s MT systems can generate rough
translations that give you at least a gist of a
document.
• High quality translations are possible of
specialized narrow domains, e.g. weather
forecasts.
16
Statistical MT
• It is virtually impossible to write an algorithm
that would render natural language grammar.
• Rather than writing explicit rules to translate
natural language,
computer algorithms are
trained
on human-translated parallel texts,
– this allows them to
automatically learn
how to
translate (thanks to neural networks, statistical
methods, etc.).
17
Statistical translation programs
• Translations are generated on the basis of:
– statistical models
analysing
bilingual text corpora
.
• E.g.
Google Translate
works by detecting patterns
in hundreds of millions of documents that have
previously been translated by humans,
• then it makes intelligent guesses based on the
patterns it learned.
• The more human-translated documents there are
in a given language, the more likely it is that the
translation will be of good quality.
18
2014-05-06
4
English-Polish MT programs
• Let’s compare two programs translating from
English to Polish (and vice versa):
– Poltran (
), Ectaco Inc.,
and
– Translatica (
), PWN.
• The sentence to translate is as follows:
Nie można zapominać także o korzyściach
społecznych i niebezpieczeństwach
wynikających z wadliwego zaprojektowania
budynku.
19
English-Polish MT programs
• Poltran:
It is not possible to forget
about social benefits also
and from defect designing
building dangers
subsequent.
• Translatica:
It isn't possible to forget ..
also about social benefits
and dangers resulting
from effective designing
the building.
20