机器翻译的层级消岐模式

VIP免费
3.0 牛悦 2024-11-19 5 4 748.06KB 140 页 15积分
侵权投诉
Chapter 1: Introduction
1
Chapter 1: Introduction
Ambiguity is a very common phenomenon in English. Most English words are
polysemous. For example, the word palm means either a kind of tree or a part of the
hand. Such polysemous words and expressions may trigger lexical ambiguities. The
task to assign an appropriate meaning to these ambiguous words according to given
contexts is called Word Sense Disambiguation (WSD for short). WSD is very vital to
machine translation.
Humans can successfully recognize the appropriate meaning of a word or
expression. For machine, however, it is much more complicated and difficult to make it
out. Furthermore, solutions to lexical ambiguities require many knowledge resources,
such as linguistic information, contextual (pragmatic) information and cultural
information. Due to the limitation of the computer technology, the representation of
most contextual and cultural knowledge remains an impossible issue and the WSD
issue in machine translation still remains unsolved.
With the development of related sciences (such as linguistics and computer
science), a lot of WSD methods are proposed to solve lexical ambiguities.
Researches on several commonly used methods show that none of these methods
is capable of disambiguating all kinds of ambiguities and each has its own
disambiguation reference and application range. For example, WordNet-based method
mainly deals with nouns and takes context nouns as its disambiguation reference. The
method based on co-occurrence features solves polysemous words with co-occurrence
words and the method based on selectional restriction can only be used when a verb or
modifier can strongly restrict the meaning of the noun. Although they can solve these
ambiguities with high accuracy rates in the evaluation, these methods mostly treat
limited kinds of ambiguity and have various shortcomings. If one of them is used to
disambiguate all kinds of polysemous words in a text, the accuracy rate on the whole is
very low. In order to obtain optimal disambiguation efficiency for a text, this thesis
proposes a hierarchical disambiguation model which combines six methods with
different disambiguation references and application ranges.
The hierarchical disambiguation model proposed in this thesis comprises six
steps, i.e., looking up idiom and collocation dictionaries, tagging part-of-speech,
A Hierarchical Disambiguation Model in Machine Translation
2
disambiguating nouns with context nouns, disambiguating verbs and nouns with
co-occurrence words, solving ambiguous words whose meaning can be strongly
restricted by selectional restrictions and assigning the most frequently used sense to
words without any disambiguation reference.
Evaluation results show that the WSD efficiency of the hierarchical
disambiguation model is much higher than that of any method working alone.
Chapter 2 An Overview of Current Word Sense Disambiguation Methods
3
Chapter 2 An Overview of Current Word Sense
Disambiguation Methods
Many disambiguation methods have been proposed since the 1950s. Among
them, idiom dictionary and compound words dictionary look-up, part-of-speech
tagging, WordNet-based method, method based on co-occurrence features, method
based on selectional restriction and most frequency method are popular and will be
briefly introduced in the following.
Due to different knowledge sources, each method has its own application range
and disambiguation reference. Application range refers to certain ambiguity
phenomenon to which the method is applicable. Disambiguation reference refers to
surrounding words which are used as semantic reference to determine the appropriate
meaning of the target word. Appropriate methods for a target text can be chosen based
on the type of the text and the application range and disambiguation reference of each
method.
§2.1 Idiom Dictionary and Compound Words Dictionary Look-up
The meaning of a ploysemous word in a fixed collocation can be determined
directly. For example, the word hand in the expression on the one hand means hand#7
(one of two sides of an issue) and the compound word black tea is translated as
in Chinese. The idiom dictionary and compound words dictionary are necessary to deal
with fixed expressions in natural languages. If meanings of the words in fixed
collocations can be found in dictionaries, other analyses will be avoided.
For some expressions which can be used as either idioms or common phrases,
dictionary look-up may assign wrong meanings to them. Take the phrase black sheep
for example. In the sentence We all thought my youngest brother was the black sheep
in our family (我们都将我的小弟弟视为我们家的败家子)’, the phrase black sheep
acts as an idiom. In the song ‘Baa, baa, black sheep, Have you any wool? Yes, sir, yes,
sir, Three bags full; One for the master, And one for the dame, And one for the little
boy Who lives down the lane (咩咩,黑羊,你有羊毛吗?是的先生,是的先生,
满三袋:一袋给主人,一袋给夫人,还有一袋给住在巷尾的小男孩)’, black sheep
A Hierarchical Disambiguation Model in Machine Translation
4
here is just a common noun phrase. The determination of expressions like black sheep
requires other reference information and other WSD methods.
§2.2 Part-of-speech Tagging
Part-of-speech (PoS) tagging means the process that the PoS tagging system tags
the part-of-speech for words in the text. Some words have different meanings
depending on their parts-of-speech, i.e. concurrences which refer to words having two
or more grammar functions, such as part of speech and meanings, in different contexts.
If its part of speech can be determined, its appropriate meaning also can be determined.
In other words, PoS tagging can solve some lexical ambiguities triggered by
concurrences. For example, the word will as a noun means an intention or wish, while
it indicates futurity as an auxiliary verb:
a) He will meet his mother at the airport tomorrow.
b) His will makes him successful.
However, if a word with a certain part of speech is still polysemous, PoS tagging
is not able to disambiguate it. For example, the word plant still has two different
commonly used meanings as a noun: plant life and industrial works. For the sentence
He has worked in the plant for 20 years’, if the parser tags the word plant as a noun,
the meaning of the word still cannot be assigned, because of its two different
meanings.
§2.3 WordNet-based Method
WordNet-based method is a kind of knowledge-based methods relying on
machine readable dictionaries which can provide word sense and its sense relation.
Lesk (1986) firstly created a knowledge base to count the overlapping content
words in the sense definition of the ambiguous word and in the context words
occurring nearby and select the sense of the target word whose signature contained the
greatest number of overlaps. Lesk took the word cone in the phrase pine cone for
example to illustrate the method. The appropriate sense of the word cone in the phrase
is chosen from its three senses by comparing the definitions of pine and cone.
pine: 1 kinds of evergreen tree with needle-shaped leaves
2 waste away through sorrow or illness
Chapter 2 An Overview of Current Word Sense Disambiguation Methods
5
Cone: 1 solid body which narrows to a point
2 something of this shape whether solid or hollow
3 fruit of certain evergreen trees
In this example, sense 1 of pine has two content words overlapping the content
words in sense 3 of cone:evergreen and tree. Hence, if the two words occur in the
same context, the sense of cone can be determined as fruit of certain evergreen trees.
Since Lesk (1986), many researchers used machine-readable dictionaries as
knowledge source for WSD. A disambiguation method using WordNetis presented by
Agirre (1996) to solve the lexical ambiguity of nouns using noun taxonomy provided by
WordNet.
WordNet developed at the Princeton Cognitive Science Laboratory is a large
freely available lexical database, which takes a hybrid approach to identify, define and
organize word senses. Word senses in WordNet are defined as synsets. In WordNet,
words are represented by their definitions, synonyms, antonyms, hypernyms
(superordinates), hyponyms (subordinates), coordinate terms and meronyms (parts).
WordNet has more than 118,000 word forms and about 90,000 synsets.
WordNet involves various sets of information which can be used to solve the
problem of ambiguity separately or in order depending on the type of text and the
requirement of the WSD model, such as synonyms, hypernym/hyponym, meronyms,
derivationally related terms, domain term and familiarity etc. In this way, the program
can deal with ambiguities based on a large scale of disambiguating information from a
free available database rather than a tagged corpus which is time and labor consuming.
Still take the word plant (noun) for example.
a) Definitions of plant:
Plant 1, works, industrial plant -- (buildings for carrying
on industrial labor; "they built a large plant to manufacture
automobiles")
Plant 2, flora, plant life -- (a living organism lacking the
power of locomotion)
Plant 3-- (something planted secretly for discovery by
another; "the police used a Plant to trick the thieves"; "he
claimed that the evidence against him was a plant")
WordNet is freely available at http://wordnet.princeton.edu/online/.
A Hierarchical Disambiguation Model in Machine Translation
6
Plant 4-- (an actor situated in the audience whose acting is
rehearsed but seems spontaneous to the audience)
b) Hypernyms synsets of plant
plant# plant#2 plant#3 plant#4
building complex life form contrivance actor
structure entity scheme
performer
artifact plan of action entertainer
object plan person
entity idea life form
content entity
cognition
psychological feature
c) Synonyms of plant
4 senses of plant
Sense 1
plant, works, industrial plant -- (buildings for carrying on
industrial labor; "they built a large plant to manufacture
automobiles")
=> building complex, complex -- (a whole
structure (as a building) made up of interconnected or related
structures)
Sense 2
plant, flora, plant life -- (a living organism lacking the
power of locomotion)
=> organism, being -- (a living thing that has (or
can develop) the ability to act or function independently)
Sense 3
plant -- (something planted secretly for discovery by
another; "the police used a plant to trick the thieves"; "he
claimed that the evidence against him was a plant")
=> contrivance, stratagem, dodge -- (an
elaborate or deceitful scheme contrived to deceive or evade; "his
testimony was just a contrivance to throw us off the track")
Chapter 2 An Overview of Current Word Sense Disambiguation Methods
7
Sense 4
plant -- (an actor situated in the audience whose acting is
rehearsed but seems spontaneous to the audience)
=> actor, histrion, player, thespian, role player --
(a theatrical performer)
d) meronyms (parts) of plant
1 of 4 senses of plant
Sense 2
plant, flora, plant life -- (a living organism lacking the
power of locomotion)
HAS PART: plant part, plant structure -- (any
part of a plant or fungus)
………………………………..
Among all kinds of information WordNet provided, synset is most commonly
used. A synset can be represented by more than one word (called synonyms) and the
same word may have more than one synset, one for each sense (see the word plant).
WordNet is built around the concepts of synset and represents both semantic
relationships and lexical relationships. It includes a list of word synsets and different
semantic relations between synsets. The first part is a list of words with different lexical
relationships. The second part is a set of semantic relations between different synsets,
like is-a relations (rat is-a mouse), part-of relations (door part-of house), substance-of,
and member-of and other relations.
WordNet-based method proposed by Agirre (1996) disambiguates target nouns
based on the theory that the higher the similarity between two words, the larger
possibility that these two words share the same semantic information in the WordNet.
That is, if the information is shared by several concepts or words, these words or
concepts can be gathered into taxonomy. In this way, if the target word occurs with
several context nouns which share the same information, the meaning of the target noun
can be determined according to these semantic related context nouns.
In this method, the parser firstly extracts nouns (W = {
1
W
,
2
W
,
,
n
W
})
appearing in a context. Each word (
i
W
)will be sought in WordNet and a set of senses
(
i
S
= {
1i
S
,
2i
S
,
,
}) are obtained. Each sense has a set of concepts in the
hypernymy/hyponymy relations and these concepts are the marks used to resolve the
A Hierarchical Disambiguation Model in Machine Translation
8
ambiguity of the noun. When the parser extracts all the context nouns and their
hypernymy synsets from WordNet, the program then assigns the common concepts (i.e.
entity) as disambiguation mark to all the senses of nouns appearing in a context. If this
concept can not solve the ambiguity, the method will go through their hypernymy
synsets in WordNet hierarchically and new marks will be assigned. During the process,
for each mark, the number of concepts contained in the subhierarchy is counted. In this
way, the sense with the highest number of words located at certain mark is chosen as the
possible sense of the noun (Agirre,1996).
WordNet-based method is used for nouns by determining their context nouns,
which refers to a list of nouns occurring in the context of the given sense of a target
noun. The disambiguation reference is these context nouns which are used to determine
their meanings according to their concept similarity in the WordNet.
Better than Lesk’s definition method, WordNet-based method uses various sets
of information in WordNet besides definitions: hypernymy/hyponymy, coordinate
terms, hypernym/hyponym and glosses. Even though definitions of the target words are
not long enough to solve ambiguities, other information provided by WordNet can be
used as disambiguation information.
WordNet-based method has its shortcomings. Its application range and
disambiguation reference are limited as nouns. It can not solve lexical ambiguities
triggered by other kinds of words. Like other dictionaries, WordNet suffers data
sparseness and can not provide all the disambiguation information for target nouns
required by the method.
§2.4 Method Based on Co-occurrence Features
Co-occurrence features refer to the information in the context of the target word
which can determine the meaning of the target word according to their co-occurrence
words, including words, word stem, part-of–speech and word position. It can be used
as effective disambiguation information for lexical ambiguities and is used in many
corpus-based methods. This method determines the possible meaning of the target
words by checking for their co-occurrence words in the context (Feng, 2004). For
example:
An electric guitar and bass player stand off to one side, not really part of the
scene, just as a sort of nod to gringo expectations perhaps.
Chapter 2 An Overview of Current Word Sense Disambiguation Methods
9
Top 10 co-occurrence words of bass in BNC are water (34382), big (24853),
play (21119), sound (13821), music (14795), fish (10748), river (9169), eat (7281) and
player (5606). If bass occurs in a context with some of these words, the meaning of the
word can be determined. In this sentence the word bass and other four words around it
are considered as the disambiguating window: guitar and bass player stand. Of ten
co-occurrence words, only guitar and player occur in the window and the sense of bass
can be identified as the member with the lowest range of a family of musical
instruments.
Co-occurrence words can be obtained from many sources, such as corpus or
dictionaries. Most of co-occurrence words of the target word are obtained from a
tagged (supervised method) or untagged (unsupervised method) corpus rather than a
readable dictionary. Many online corpus can be found, i.e. BNC, ANC(American
National Corpus), Semcorand Sinica(a tagged corpus for the study of Chinese).
These corpuses provide a large scale of sense-tagged data and information for the
methods. Co-occurrence words also can be obtained from other sources, such as
machine readable dictionaries, such as WordNet. WordNet can provide coordinate
terms for many nouns and verbs, synonyms/related nouns/antonyms for adjectives and
synonyms/stem adjectives for adverbs. Other dictionaries, i.e. LDOCE, also can
provide such kind of information which can be used to obtain sense related words for
the target words.
Although method based on co-occurrence features mainly solve the lexical
ambiguities of nouns like WordNet-based method, it takes verbs in its application
range as well. For example, if verb play co-occurs with the word games or football, its
correct sense will be determined as participate in games or sport
, and if with words
drum and piano, the sense is play on an instrument. Nouns, adjectives, prepositions
and other kinds of words can be used for disambiguation references. Since many words
has no co-occurrence words in a context or some co-occurrence words may not
semantically related with the target word, the application range of method based on
co-occurrence features limited. Take the sentence ‘She went into the new plant with an
apple’ for example. The word apple is a co-occurrence word of plant which is used to
BNC is available at www.natcorp.ox.ac.uk/.
ANC is available at www.anc.org.za/.
SemCor is available at http://www.cs.unt.edu/~rada/downloads.html#semcor.
Sinica is available at http://www.sinica.edu.tw/ftms-bin/kiwi1/pkiwi.sh.
Senses of words in this thesis are from WordNet.
A Hierarchical Disambiguation Model in Machine Translation
10
identify plant as plant#2(a living organism lacking the power of locomotion), but
plant in this sentence is plant#1 (buildings for carrying on industrial labor). If method
based on co-occurrence features is used, the sentence will be mistranslated as ‘她带着
一个苹果进了新的植物’.
§2.5 Method Based on Selectional Restriction
Selectional restriction was firstly applied to natural language processing by Hirst
(1987). It solves ambiguities mainly depending on semantic features, semantic
relationship and collocation of words.
Since semantic features can provide semantic constraints (selectional restriction)
which can decide whether the word with certain semantic features can occur in certain
collocation or phrase, semantic features are important at the level of word
disambiguation by rejecting the incompatible feature set of combination. Transfer
algorithms can use these features and information to identify the correct meaning of a
word and choose its equivalent in target language. For example, verb drink demands its
AGENT as ANIMATE and the PATIENT as LIQUID (CONCRETE other than
ABSTRACT), such as beer, coffee and soft drink. Hence, only a noun with the feature
of LIQUID can be accepted as the patient. The word drink is described as “cat=verb,
AGNET=HUMAN, PATIENT=LIQUID” in the dictionary. For example:
a. John drank day and night.
b. There was no liquid drunk at the meeting.
In sentence a, the parser will not identify day and night as the patient of drunk
but the circumstantial adverbial of the sentence. In sentence b, the phrase liquid drunk
will not be interpreted as a noun phrase, because the word liquid is not an animate
agent of drunk.
Like the word drink, the word eat also needs an ANIMATE AGENT and an
EDIBLE PATIENT, which is also concrete, such as cake, apple. For example,
John ate the game.
The word game has two different meanings: play or sport and a kind of wild
animal hunted or fished for food. It is clear that the patient of the word eat must be
edible. In this case, the meaning of play or sport will be rejected and the appropriate
meaning of the word game in this sentence is identified as a kind of wild animals
Pllant#2 refers to the second sense of plant in WordNet.
摘要:

Chapter1:Introduction1Chapter1:IntroductionAmbiguityisaverycommonphenomenoninEnglish.MostEnglishwordsarepolysemous.Forexample,thewordpalmmeanseitherakindoftreeorapartofthehand.Suchpolysemouswordsandexpressionsmaytriggerlexicalambiguities.Thetasktoassignanappropriatemeaningtotheseambiguouswordsaccord...

展开>> 收起<<
机器翻译的层级消岐模式.pdf

共140页,预览10页

还剩页未读, 继续阅读

作者:牛悦 分类:高等教育资料 价格:15积分 属性:140 页 大小:748.06KB 格式:PDF 时间:2024-11-19

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 140
客服
关注