机器翻译中的词义排歧

VIP免费

3.0 牛悦 2024-11-19 4 4 989.36KB 106 页 15积分

侵权投诉

机器翻译中的词义排歧

Word Sense Disambiguation

In Machine Translation

摘要

机器翻译是当代科学技术的十大难题之一，而词义排歧是机器翻译中最困难

的问题。如果词义排歧不能解决，机器翻译的译文质量就不可能有质的提高。

本文首先评价了目前常用的几种词义排歧方法，接下来介绍作者设计的词义

排歧实验。在实验中作者从大学英语四六级新大纲规定的词汇中选取了 5260 个常

用词编入了机读词典, 选取了五个有代表性的多义词（“bank”, “old”, “draw”,

“sweet” 和“since”）作为词义排歧的实验对象，从英国国家语料库中提取了三万

左右含有这五个词的句子，并通过编程对这些语料进行分析研究。最后作者用 C++

语言设计了“词义排歧系统”和“词义标注系统”对本文提出的假设进行编程测

试。

实验证明：每个多义词词义变化的规律都不同，无法归纳出普遍规律；虽然

有些语义变化有一些规律，但是大多数多义词的词义变化都没有普遍规律可循，

各种排歧方法不能解决词义排歧的原因就是忽略了多义词词义变化是没有什么普

遍规律可言的事实。因此，要首先研究出每个多义词词义变化的特殊规律，再根

据这些规律选择合适的排歧方法来提高排歧效率。因为没有通用的排歧方法，各

种排歧方法又都有不足之处，所以排歧系统要能够根据可用的语境信息灵活地选

择合适的排歧方法。词义排歧中最因难的任务不是创造出一个全新的排歧方法，

而是研究出每个多义词词义变化的规律。无论排歧方法多么新颖，没有用于排歧

的数据，什么方法都无法发挥作用。排歧效率取决于用于排歧的数据的质量，而

不研究出每个多义词词义变化的特殊规律就不可能得到高质量的可用于排歧的数

据。

虽然人们认为普遍性的规律有实用价值，但是在词义排歧中多义词词义变化

的特殊规律比一般性的语义规律更重要。因为多义词词义变化规律复杂多变无法

归纳，我们只有一个一个地研究出每个多义词的词义变化规律，再逐个解决多义

词的词义排歧问题，直到大多数的常用多义词的词义排歧问题都得到解决，机器

翻译的质量令人满意。这就是说设计一个排歧系统是一项艰巨的任务，事实上对

一个多义词的一个词义的排歧都可能会是费时而又艰辛的工作，更何况要对数以

万计的词义进行排歧。所以，排歧是非常不易的，而试图找到一个普遍通用的排

歧方法又是不可能的。但这并不是说词义排歧不可能解决，只是词义排歧需要长

期不懈的努力。当然投入的艰辛的劳动也会得到丰厚的回报，因为成功的机器翻

译系统必定会带来巨大的商业利润。

关键词：词义排歧机器翻译同现词词义排歧系统词义标注系统

ABSTRACT

Machine translation (MT for short) is considered one of the ten most difficult

problems to solve in science and technology (Feng 2), and word sense disambiguation

(WSD for short) is the most difficult issue in MT. If it could not be solved, MT could

hardly achieve any substantial development.

This paper first comments on the current WSD methods, then introduces the pilot

study carried out by the author. In this study, 5260 frequently used words chosen from

CET 4 (College English Test) and CET 6 vocabularies are compiled into a machine

readable dictionary. About thirty thousand sentences containing the polysemous words

“bank”, “old”, “draw”, “sweet” and “since”①are downloaded from the British

National Corpus and analyzed by the programs compiled in C++ language. Then the

author designed two WSD systems “WSD Machine” and “WSD Machine Sense

Marker” to test the hypothesis proposed in this paper.

This pilot study proves that the specific semantic rules vary from sense to sense,

and from word to word, and thus cannot be generalized. Most of the usually used

methods cannot solve the WSD issue in MT, because they neglect the fact that while

some semantic rules can be generalized, many others are too specific to be generalized,

especially the specific semantic rules of each polysemous word. Therefore, the specific

semantic rules of each polysemous word must be worked out first before utilizing

appropriate methods to improve the efficiency of WSD. Since all the WSD methods

have disadvantages and none is universally applicable to all kinds of ambiguities, a

well designed WSD system should be able to choose appropriate WSD methods

according to the available WSD information found in the context. The most difficult

issue in WSD is not the task to innovate a totally new WSD method, but the task to

work out the specific semantic rules of the senses of each polysemous word. No matter

how innovative a method is, without WSD data it cannot work at all. The WSD

efficiency hinges on the quality of the data, and high-quality data cannot be collected

before working out the specific semantic rules of the senses of each polysemous word.

Although it is widely accepted that rules should be generalized for practical

purposes, in WSD the specific semantic rules of the polysemous words turn out to be

①“bank”, “old”, “draw”, “sweet” and “since” cover most of the word classes of polysemous words, i.e. noun, verb,

adjective, preposition and conjunction. The WSD methods of these five words are very typical, which will be

illustrated later.

much more important than the general rules. Since the semantic rules of polysemous

words are too multifarious to be generalized, we should work out the specific semantic

rules of the senses of each polysemous word, then solve the WSD issue word by word

till most of the frequently used polysemous words have been well disambiguated and

the quality of MT is satisfactory. This means painstaking effort must be made in

designing a WSD system. The fact is, even the disambiguation of one sense of a

polysemous word might be an arduous and time-consuming work, let alone thousands

of polysemous words with thousands upon thousands of senses. Therefore, WSD is no

easy task and to attempt to find a universally applicable approach is to beg for

frustration. Nonetheless, WSD is not a matter of impossibility, but a matter of time and

efforts. And the great efforts made in WSD will be well rewarded because a successful

MT system must be very profitable.

Key Words: word sense disambiguation, machine translation,

co-occurrence word, WSD Machine, WSD Machine Sense Marker

Contents

摘要

ABSTRACT

Chapter One: Introduction................................................................................................ 1

Chapter Two: Current Word Sense Disambiguation Methods .......................................... 3

Chapter Three: A Pilot Study of Word Sense Disambiguation ......................................... 8

§3.1 Adaptation of the word sense disambiguation method based on co-occurrence

features.......................................................................................................................9

§3.1.1 The co-occurrence words of “bank”...................................................... 10

§3.1.2 Limitation of the method based on co-occurrence words ......................14

§3.1.3 Other co-occurrence features................................................................. 16

§3.1.4 Priority of monosemantic expressions in machine translation .............. 17

§3.1.5 Compilation of a machine readable dictionary based on co-occurrence

features..............................................................................................................17

§3.1.6 “WSD Machine Sense Marker” and the word sense disambiguation

result..................................................................................................................22

§3.2 The key role of grammatical structures in disambiguating the senses of

“since”......................................................................................................................24

§3.2.1 Incompetence of the current word sense disambiguation methods in

disambiguating the senses of “since” ................................................................24

§3.2.2 The different grammatical structures in which the two main senses are

used................................................................................................................... 26

§3.2.3 Word sense disambiguation result of “since”.........................................27

§3.3 Word sense disambiguation of “old”, “draw” and “sweet”..............................27

§3.3.1 Difference between the word sense disambiguation of “old” and “bank”27

§3.3.2 Multifarious word sense disambiguation methods required in

disambiguating the forty five senses of “draw”................................................29

§3.3.3 The ability of understanding required in disambiguating polysemous

words in complicated situations........................................................................38

§3.3.4 Significance of probability in word sense disambiguation.................... 40

§3.3.5 Application of componential analysis theory in compiling a machine

readable dictionary............................................................................................44

§3.3.6 Utilization of WordNet in the compilation of a machine readable

dictionary.......................................................................................................... 45

§3.3.7 Pragmatics and word sense disambiguation.......................................... 46

§3.3.8 Example-based and statistics-based approaches in machine translation46

Chapter Four: Conclusion............................................................................................... 47

Appendix I: WSD Machine.cpp......................................................................................49

Appendix II: Co-occurrence Frequencies Counter.cpp...................................................69

Appendix III: WSD Machine Sense Marker.cpp............................................................ 74

Appendix IV: WSD Result of “bank” (100 examples)................................................... 93

Appendix V: WSD Machine Installation Package .......................................................... 98

Bibliography....................................................................................................................99

在读期间公开发表的论文和承担科研项目及取得成果 ...........................................101

Acknowledgements.......................................................................................................102

Chapter One: Introduction

In machine translation (MT for short), the term “ambiguous” usually means

“polysemous”. Ambiguous words or sentences in MT are not really ambiguous in a

certain context. For example, the word “bachelor” in the sentence “He got his bachelor

degree last year” is not ambiguous because it can only have one possible meaning in

this context; however it is still called ambiguous in MT. When a word or a sentence is

called ambiguous in MT, it means there is more than one meaning for the computer to

choose from. Therefore, word sense disambiguation (WSD for short) in MT is the

process in which the computer chooses the appropriate sense in the context from the

several possible senses of a word. WSD in MT does not mean that computer can

disambiguate a really ambiguous word in a sentence. Without context, English

speakers cannot decide the meaning of “bachelor” in the sentence “he is a bachelor”,

let alone a computer.

The quality of MT hinges on WSD to a large extent. Up to now, there is no such a

translation system in the world that can disambiguate very successfully in all kinds of

domains and contexts. More often than not, a translation machine is blamed for not

choosing the correct sense rather than for the wrong word ordering or for the awkward

wording. Readers can put up with wrong word ordering or awkward wording of a

translation to some extent, but they cannot accept it when it frequently chooses wrong

senses for polysemous words. Even one single mistake in WSD might cause the

translation to be unreadable because the wrong sense may destroy the coherence and

the unity of the whole passage.

WSD is the most difficult issue in MT. More than 90% of the frequently used

words in English have more than one meaning. With the help of contextual information

and their real world knowledge, people can easily choose the proper meaning for a

polysemous word. But it is very difficult for a computer to do so, because every

language is complicated and the real world knowledge is too large and too multifarious

for a computer to process. Furthermore, a computer does not have the very ability of

reasoning. Therefore, the WSD problems involving the world knowledge and

reasoning still remain unsolved. But for most polysemous words, it is possible to

improve the WSD efficiency.

Word Sense Disambiguation in Machine Translation

This paper is oriented to the study of WSD in MT with the purpose to find out the

ways to improve WSD efficiency. In this paper, the current WSD methods are studied

and a pilot study of WSD is carried out by the author. In this study, thousands of

sentences containing the polysemous words “bank”, “old”, “draw”, “sweet” and

“since” are downloaded from the British National Corpus (BNC for short) and

analyzed by the programs compiled in C++ language by the author. These five words

cover most of the word classes of polysemous words, i.e. noun, verb, adjective,

preposition and conjunction. The WSD methods of these five words are very typical,

which will be illustrated later. The well-known WordNet system is also used to study

the WSD methods based on co-occurrence words and/or selectional restriction. After

conclusions have been drawn, the author designed two WSD systems “WSD Machine”

and “WSD Machine Sense Marker” to test the conclusions. All the programs used in

this pilot study are written in C++ language with the compiler “Microsoft Visual C++

6.0”, and important source files are attached in the appendices.

Chapter Two: Current word sense disambiguation methods

Chapter Two: Current Word Sense Disambiguation

Methods

Linguists began to study WSD from 1950s. More than forty years passed, WSD

still remains unsolved. During the past few decades, several WSD methods have been

suggested and tested (Feng 573-595). These methods, if used in a proper way, might be

very efficient. But for some reasons, none of the methods have achieved high

efficiency at a large scale. These methods will be explained in the following parts.

The first method is to choose the most frequently used sense. The efficiency of

this method is very low. However, most translation systems choose this method

because it is simple and for the moment there are no other universally applicable

methods to adopt. One English-Chinese translation system adopting this method

mistranslates “he walked to the bank to see the current” into “他走路去银行见到涌

流” (i.e., he walked to the financial institution to see the flow).

The second method is to use the part of speech information in WSD. For example,

the first “can” in the sentence “we can what we can eat” must be a transitive verb.

Therefore, it should be translated as “把…装罐” (preserve in a can or tin) rather than

“能够” (be able to). This method alone cannot disambiguate word senses effectively

when there are several senses of the same word class, e.g. it cannot decide whether

“the man is rather old” should be translated as “这个人很旧了” (i.e. the man is not

new) or it should be translated as “这个人很老了” (i.e. the man is not young), because

both “老的” (not young) and “旧的” (not new) belong to the same adjective word class.

This method can narrow down the choices to one word class and thus can be used as

auxiliary method together with other methods. Many translation systems have

incorporated this method in their parsers.

The third method is to use selectional restriction and type hierarchies’ information

in WSD. Take the above sentence for example, usually the sense “老的” (not young)

requires that its subject should have the property of being a living thing; while “旧的”

(not new) has the restriction that the subject should not be a living thing.

In this aspect, a lot of research work has been carried out. The most significant

achievement is made by Princeton University Cognitive Science Lab—the WordNet

software. Although WordNet is not specially designed for WSD, it can provide us with

Word Sense Disambiguation in Machine Translation

type hierarchies’ information of words, which can be used in WSD. WordNet is very

useful in WSD and advanced programmers can directly import the data of WordNet

into their own systems. Nowadays more and more scholars begin to study WordNet

and some interface software has come into being.

In China, a bilingual system came into being. It is HowNet①, designed by Dong

ZhenDong, a distinguished professor in China Academy of Sciences, who considers

HowNet as a Chinese-English common-sense knowledge system. Huang CengYang

designed HNC②(Hierarchical Network of Concepts). All these systems have made

great contribution to WSD. However, language is always living and creative, and it

will break any conventions or restrictions when needed. For example, in the sentence

“the bank bought all the stocks from that company”, the word “bank” is not playing the

role “location” as HowNet specifies. The following example goes even farther from the

restriction: in the sentence “it is reported that the old woman can eat stones”, the word

after “eat” usually should belong to edible class. Although the above sentences break

the restrictions and, they are completely acceptable. And this kind of sentences is

frequently used in communication. Therefore, WSD systems should adapt themselves

to this kind of flexibility of language, and WSD method based on selectional restriction

and type hierarchies’ information should not be used in an absolute way (Feng 580). In

Optimality Theory, Kene Kager makes the following statement (Kager 3):

Constrains are violable. Violation of a constraint is not a direct cause of

ungrammaticality, nor is absolute satisfaction of all constraints essential to the

least costly violation of the constraints. Constraints are intrinsically in

CONFLICT, hence every logically possible output of any grammar will

necessarily violate at least some constraint. Grammars must be able to regulate

conflicts between universal constraints, in order to select the “most harmonic”

or “optimal” output form. This conflict-regulating mechanism consists of a

RANKING of universal constraints. Languages basically differ in their ranking of

constraints. Each violation of a constraint is avoided; yet the violation of

higher-ranked constraints is avoided “more forcefully” than the violation of

lower-ranked constraints.

①This software is free and can be downloaded from http://wordnet.princeton.edu/obtain.

②For detailed reference, please visit http://www.hncnlp.com/.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

15 积分 4人已下载

立即下载 VIP免费下载

摘要：

机器翻译中的词义排歧WordSenseDisambiguationInMachineTranslation摘要机器翻译是当代科学技术的十大难题之一，而词义排歧是机器翻译中最困难的问题。如果词义排歧不能解决，机器翻译的译文质量就不可能有质的提高。本文首先评价了目前常用的几种词义排歧方法，接下来介绍作者设计的词义排歧实验。在实验中作者从大学英语四六级新大纲规定的词汇中选取了5260个常用词编入了机读词典,选取了五个有代表性的多义词（“bank”,“old”,“draw”,“sweet”和“since”）作为词义排歧的实验对象，从英国国家语料库中提取了三万左右含有这五个词的句子，并通过编程对这些语料...

展开>> 收起<<

机器翻译中的词义排歧.pdf

共106页,预览10页

还剩页未读，继续阅读

机器翻译中的词义排歧

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

推荐作者

热门标签

举报选择: