机器翻译中的词义排歧

VIP免费
3.0 牛悦 2024-11-19 4 4 989.36KB 106 页 15积分
侵权投诉
机器翻译中的词义排歧
Word Sense Disambiguation
In Machine Translation
摘 要
机器翻译是当代科学技术的十大难题之一,而词义排歧是机器翻译中最困难
的问题。如果词义排歧不能解决,机器翻译的译文质量就不可能有质的提高。
本文首先评价了目前常用的几种词义排歧方法,接下来介绍作者设计的词义
排歧实验。在实验中作者从大学英语四六级新大纲规定的词汇中选取了 5260 个常
用词编入了机读词典, 选取了五个有代表性的多义词(“bank”, “old”, “draw”,
“sweet” “since”作为词义排歧的实验对象,从英国国家语料库中提取了三万
左右含有这五个词的句子,并通过编程对这些语料进行分析研究。最后作者用 C++
语言设计了“词义排歧系统”和“词义标注系统”对本文提出的假设进行编程测
试。
实验证明:每个多义词词义变化的规律都不同,无法归纳出普遍规律;虽
有些语义变化有一些规律,但是大多数多义词的词义变化都没有普遍规律可循,
各种排歧方法不能解决词义排歧的原因就是忽略了多义词词义变化是没有什么普
遍规律可言的事实。因此,要首先研究出每个多义词词义变化的特殊规律,再根
据这些规律选择合适的排歧方法来提高排歧效率。因为没有通用的排歧方法,各
种排歧方法又都有不足之处,所以排歧系统要能够根据可用的语境信息灵活地选
择合适的排歧方法。 词义排歧中最因难的任务不是创造出一个全新的排歧方法,
而是研究出每个多义词词义变化的规律。无论排歧方法多么新颖,没有用于排歧
的数据,什么方法都无法发挥作用。排歧效率取决于用于排歧的数据的质量,而
不研究出每个多义词词义变化的特殊规律就不可能得到高质量的可用于排歧的数
据。
虽然人们认为普遍性的规律有实用价值,但是在词义排歧中多义词词义变
的特殊规律比一般性的语义规律更重要。因为多义词词义变化规律复杂多变无法
归纳,我们只有一个一个地研究出每个多义词的词义变化规律,再逐个解决多义
词的词义排歧问题,直到大多数的常用多义词的词义排歧问题都得到解决,机器
翻译的质量令人满意。这就是说设计一个排歧系统是一项艰巨的任务,事实上对
一个多义词的一个词义的排歧都可能会是费时而又艰辛的工作,更何况要对数以
万计的词义进行排歧。所以,排歧是非常不易的,而试图找到一个普遍通用的排
歧方法又是不可能的。但这并不是说词义排歧不可能解决,只是词义排歧需要长
期不懈的努力。当然投入的艰辛的劳动也会得到丰厚的回报,因为成功的机器翻
译系统必定会带来巨大的商业利润。
关键词:词义排歧 机器翻译 同现词 词义排歧系统 词义标注系统
ABSTRACT
Machine translation (MT for short) is considered one of the ten most difficult
problems to solve in science and technology (Feng 2), and word sense disambiguation
(WSD for short) is the most difficult issue in MT. If it could not be solved, MT could
hardly achieve any substantial development.
This paper first comments on the current WSD methods, then introduces the pilot
study carried out by the author. In this study, 5260 frequently used words chosen from
CET 4 (College English Test) and CET 6 vocabularies are compiled into a machine
readable dictionary. About thirty thousand sentences containing the polysemous words
“bank”, “old”, “draw”, “sweet” and “since”are downloaded from the British
National Corpus and analyzed by the programs compiled in C++ language. Then the
author designed two WSD systems “WSD Machine” and “WSD Machine Sense
Marker” to test the hypothesis proposed in this paper.
This pilot study proves that the specific semantic rules vary from sense to sense,
and from word to word, and thus cannot be generalized. Most of the usually used
methods cannot solve the WSD issue in MT, because they neglect the fact that while
some semantic rules can be generalized, many others are too specific to be generalized,
especially the specific semantic rules of each polysemous word. Therefore, the specific
semantic rules of each polysemous word must be worked out first before utilizing
appropriate methods to improve the efficiency of WSD. Since all the WSD methods
have disadvantages and none is universally applicable to all kinds of ambiguities, a
well designed WSD system should be able to choose appropriate WSD methods
according to the available WSD information found in the context. The most difficult
issue in WSD is not the task to innovate a totally new WSD method, but the task to
work out the specific semantic rules of the senses of each polysemous word. No matter
how innovative a method is, without WSD data it cannot work at all. The WSD
efficiency hinges on the quality of the data, and high-quality data cannot be collected
before working out the specific semantic rules of the senses of each polysemous word.
Although it is widely accepted that rules should be generalized for practical
purposes, in WSD the specific semantic rules of the polysemous words turn out to be
bank”, “old”, “draw”, “sweet” and “since” cover most of the word classes of polysemous words, i.e. noun, verb,
adjective, preposition and conjunction. The WSD methods of these five words are very typical, which will be
illustrated later.
much more important than the general rules. Since the semantic rules of polysemous
words are too multifarious to be generalized, we should work out the specific semantic
rules of the senses of each polysemous word, then solve the WSD issue word by word
till most of the frequently used polysemous words have been well disambiguated and
the quality of MT is satisfactory. This means painstaking effort must be made in
designing a WSD system. The fact is, even the disambiguation of one sense of a
polysemous word might be an arduous and time-consuming work, let alone thousands
of polysemous words with thousands upon thousands of senses. Therefore, WSD is no
easy task and to attempt to find a universally applicable approach is to beg for
frustration. Nonetheless, WSD is not a matter of impossibility, but a matter of time and
efforts. And the great efforts made in WSD will be well rewarded because a successful
MT system must be very profitable.
Key Words: word sense disambiguation, machine translation,
co-occurrence word, WSD Machine, WSD Machine Sense Marker
Contents
摘要
ABSTRACT
Chapter One: Introduction................................................................................................ 1
Chapter Two: Current Word Sense Disambiguation Methods .......................................... 3
Chapter Three: A Pilot Study of Word Sense Disambiguation ......................................... 8
§3.1 Adaptation of the word sense disambiguation method based on co-occurrence
features.......................................................................................................................9
§3.1.1 The co-occurrence words of “bank”...................................................... 10
§3.1.2 Limitation of the method based on co-occurrence words ......................14
§3.1.3 Other co-occurrence features................................................................. 16
§3.1.4 Priority of monosemantic expressions in machine translation .............. 17
§3.1.5 Compilation of a machine readable dictionary based on co-occurrence
features..............................................................................................................17
§3.1.6 “WSD Machine Sense Marker” and the word sense disambiguation
result..................................................................................................................22
§3.2 The key role of grammatical structures in disambiguating the senses of
“since”......................................................................................................................24
§3.2.1 Incompetence of the current word sense disambiguation methods in
disambiguating the senses of “since” ................................................................24
§3.2.2 The different grammatical structures in which the two main senses are
used................................................................................................................... 26
§3.2.3 Word sense disambiguation result of “since”.........................................27
§3.3 Word sense disambiguation of “old”, “draw” and “sweet”..............................27
§3.3.1 Difference between the word sense disambiguation of “old” and “bank”27
§3.3.2 Multifarious word sense disambiguation methods required in
disambiguating the forty five senses of “draw”................................................29
§3.3.3 The ability of understanding required in disambiguating polysemous
words in complicated situations........................................................................38
§3.3.4 Significance of probability in word sense disambiguation.................... 40
§3.3.5 Application of componential analysis theory in compiling a machine
readable dictionary............................................................................................44
§3.3.6 Utilization of WordNet in the compilation of a machine readable
dictionary.......................................................................................................... 45
§3.3.7 Pragmatics and word sense disambiguation.......................................... 46
§3.3.8 Example-based and statistics-based approaches in machine translation46
Chapter Four: Conclusion............................................................................................... 47
Appendix I: WSD Machine.cpp......................................................................................49
Appendix II: Co-occurrence Frequencies Counter.cpp...................................................69
Appendix III: WSD Machine Sense Marker.cpp............................................................ 74
Appendix IV: WSD Result of “bank” (100 examples)................................................... 93
Appendix V: WSD Machine Installation Package .......................................................... 98
Bibliography....................................................................................................................99
在读期间公开发表的论文和承担科研项目及取得成果 ...........................................101
Acknowledgements.......................................................................................................102
Chapter One: Introduction
1
Chapter One: Introduction
In machine translation (MT for short), the term “ambiguous” usually means
“polysemous”. Ambiguous words or sentences in MT are not really ambiguous in a
certain context. For example, the word “bachelor” in the sentence “He got his bachelor
degree last year” is not ambiguous because it can only have one possible meaning in
this context; however it is still called ambiguous in MT. When a word or a sentence is
called ambiguous in MT, it means there is more than one meaning for the computer to
choose from. Therefore, word sense disambiguation (WSD for short) in MT is the
process in which the computer chooses the appropriate sense in the context from the
several possible senses of a word. WSD in MT does not mean that computer can
disambiguate a really ambiguous word in a sentence. Without context, English
speakers cannot decide the meaning of “bachelor” in the sentence “he is a bachelor”,
let alone a computer.
The quality of MT hinges on WSD to a large extent. Up to now, there is no such a
translation system in the world that can disambiguate very successfully in all kinds of
domains and contexts. More often than not, a translation machine is blamed for not
choosing the correct sense rather than for the wrong word ordering or for the awkward
wording. Readers can put up with wrong word ordering or awkward wording of a
translation to some extent, but they cannot accept it when it frequently chooses wrong
senses for polysemous words. Even one single mistake in WSD might cause the
translation to be unreadable because the wrong sense may destroy the coherence and
the unity of the whole passage.
WSD is the most difficult issue in MT. More than 90% of the frequently used
words in English have more than one meaning. With the help of contextual information
and their real world knowledge, people can easily choose the proper meaning for a
polysemous word. But it is very difficult for a computer to do so, because every
language is complicated and the real world knowledge is too large and too multifarious
for a computer to process. Furthermore, a computer does not have the very ability of
reasoning. Therefore, the WSD problems involving the world knowledge and
reasoning still remain unsolved. But for most polysemous words, it is possible to
improve the WSD efficiency.
Word Sense Disambiguation in Machine Translation
2
This paper is oriented to the study of WSD in MT with the purpose to find out the
ways to improve WSD efficiency. In this paper, the current WSD methods are studied
and a pilot study of WSD is carried out by the author. In this study, thousands of
sentences containing the polysemous words “bank”, “old”, “draw”, “sweet” and
“since” are downloaded from the British National Corpus (BNC for short) and
analyzed by the programs compiled in C++ language by the author. These five words
cover most of the word classes of polysemous words, i.e. noun, verb, adjective,
preposition and conjunction. The WSD methods of these five words are very typical,
which will be illustrated later. The well-known WordNet system is also used to study
the WSD methods based on co-occurrence words and/or selectional restriction. After
conclusions have been drawn, the author designed two WSD systems “WSD Machine
and “WSD Machine Sense Marker” to test the conclusions. All the programs used in
this pilot study are written in C++ language with the compiler “Microsoft Visual C++
6.0”, and important source files are attached in the appendices.
Chapter Two: Current word sense disambiguation methods
3
Chapter Two: Current Word Sense Disambiguation
Methods
Linguists began to study WSD from 1950s. More than forty years passed, WSD
still remains unsolved. During the past few decades, several WSD methods have been
suggested and tested (Feng 573-595). These methods, if used in a proper way, might be
very efficient. But for some reasons, none of the methods have achieved high
efficiency at a large scale. These methods will be explained in the following parts.
The first method is to choose the most frequently used sense. The efficiency of
this method is very low. However, most translation systems choose this method
because it is simple and for the moment there are no other universally applicable
methods to adopt. One English-Chinese translation system adopting this method
mistranslates “he walked to the bank to see the current” into “
” (i.e., he walked to the financial institution to see the flow).
The second method is to use the part of speech information in WSD. For example,
the first “can” in the sentence “we can what we can eat” must be a transitive verb.
Therefore, it should be translated as “装罐” (preserve in a can or tin) rather than
” (be able to). This method alone cannot disambiguate word senses effectively
when there are several senses of the same word class, e.g. it cannot decide whether
“the man is rather old” should be translated as “这个人很旧了 (i.e. the man is not
new) or it should be translated as “个人老了” (i.e. the man is not young), because
both “老的” (not young) and “旧的 (not new) belong to the same adjective word class.
This method can narrow down the choices to one word class and thus can be used as
auxiliary method together with other methods. Many translation systems have
incorporated this method in their parsers.
The third method is to use selectional restriction and type hierarchies’ information
in WSD. Take the above sentence for example, usually the sense “ (not young)
requires that its subject should have the property of being a living thing; while “
(not new) has the restriction that the subject should not be a living thing.
In this aspect, a lot of research work has been carried out. The most significant
achievement is made by Princeton University Cognitive Science Lab—the WordNet
software. Although WordNet is not specially designed for WSD, it can provide us with
Word Sense Disambiguation in Machine Translation
4
type hierarchies’ information of words, which can be used in WSD. WordNet is very
useful in WSD and advanced programmers can directly import the data of WordNet
into their own systems. Nowadays more and more scholars begin to study WordNet
and some interface software has come into being.
In China, a bilingual system came into being. It is HowNet, designed by Dong
ZhenDong, a distinguished professor in China Academy of Sciences, who considers
HowNet as a Chinese-English common-sense knowledge system. Huang CengYang
designed HNC(Hierarchical Network of Concepts). All these systems have made
great contribution to WSD. However, language is always living and creative, and it
will break any conventions or restrictions when needed. For example, in the sentence
“the bank bought all the stocks from that company”, the word “bank” is not playing the
role “location” as HowNet specifies. The following example goes even farther from the
restriction: in the sentence “it is reported that the old woman can eat stones”, the word
after “eat” usually should belong to edible class. Although the above sentences break
the restrictions and, they are completely acceptable. And this kind of sentences is
frequently used in communication. Therefore, WSD systems should adapt themselves
to this kind of flexibility of language, and WSD method based on selectional restriction
and type hierarchies’ information should not be used in an absolute way (Feng 580). In
Optimality Theory, Kene Kager makes the following statement (Kager 3):
Constrains are violable. Violation of a constraint is not a direct cause of
ungrammaticality, nor is absolute satisfaction of all constraints essential to the
least costly violation of the constraints. Constraints are intrinsically in
CONFLICT, hence every logically possible output of any grammar will
necessarily violate at least some constraint. Grammars must be able to regulate
conflicts between universal constraints, in order to select the “most harmonic”
or “optimal” output form. This conflict-regulating mechanism consists of a
RANKING of universal constraints. Languages basically differ in their ranking of
constraints. Each violation of a constraint is avoided; yet the violation of
higher-ranked constraints is avoided “more forcefully” than the violation of
lower-ranked constraints.
This software is free and can be downloaded from http://wordnet.princeton.edu/obtain.
For detailed reference, please visit http://www.hncnlp.com/.
摘要:

机器翻译中的词义排歧WordSenseDisambiguationInMachineTranslation摘要机器翻译是当代科学技术的十大难题之一,而词义排歧是机器翻译中最困难的问题。如果词义排歧不能解决,机器翻译的译文质量就不可能有质的提高。本文首先评价了目前常用的几种词义排歧方法,接下来介绍作者设计的词义排歧实验。在实验中作者从大学英语四六级新大纲规定的词汇中选取了5260个常用词编入了机读词典,选取了五个有代表性的多义词(“bank”,“old”,“draw”,“sweet”和“since”)作为词义排歧的实验对象,从英国国家语料库中提取了三万左右含有这五个词的句子,并通过编程对这些语料...

展开>> 收起<<
机器翻译中的词义排歧.pdf

共106页,预览10页

还剩页未读, 继续阅读

作者:牛悦 分类:高等教育资料 价格:15积分 属性:106 页 大小:989.36KB 格式:PDF 时间:2024-11-19

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 106
客服
关注