an operator based system for natural language...

26
An Operator Based System for Natural Language Analysis Yusuf Altunel Mehmet R. Tolun Public Communications Department of Computer Engineering Systems Middle East Technical University Siemens A.S Inonu Bulvari Ankara Ankara Turkey 06660 Turkey 06531 {altunel@ erkin.ceng.metu.edu.tr } {[email protected]} Abstract In this paper we present a formalism for an operator-based system and its application to natural language analysis. In the present approach there are no feature structures or their direct references. Hence our formalism, as stated, is unification-free. One of the major strengths of this formalism is the ability to express phonological, morphological and syntactic rules of a natural language which makes the approach easily adaptable and language independent. The formalism is implemented in PCSCHEME-a dialect of Lisp programming language and is based on top-down, depth-first search with an exhaustive evaluation property to return all possible results. 1. Introduction The term formalism is taken to be as the systematic tool which is used to represent information about languages in a well defined system, as well as the methodology to collect this information from the definitions. In this context, the formalism has two parts: the representation side whose aim is to define a framework to express the rules of a Natural Language (NL) and the operational side which is the set of evaluations used to define how these rules will be used to retrieve information. By separating the representational side from the operational side a certain amount of flexibility can be gained (Byrd, et. al. 1987: 7). Since the representation is considered to be completely different from the implementation, the formalism 1

Upload: others

Post on 26-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Operator Based System for Natural Language Analysisweb.iku.edu.tr/.../anoperatorbasedsystemfornaturallanguageanalys…  · Web viewAn Operator Based System for Natural Language

An Operator Based System for Natural Language Analysis

Yusuf Altunel Mehmet R. Tolun Public Communications Department of Computer Engineering

Systems Middle East Technical UniversitySiemens A.S Inonu BulvariAnkara AnkaraTurkey 06660 Turkey 06531{altunel@ erkin.ceng.metu.edu.tr } {[email protected]}

Abstract

In this paper we present a formalism for an operator-based system and its application to natural language analysis. In the present approach there are no feature structures or their direct references. Hence our formalism, as stated, is unification-free. One of the major strengths of this formalism is the ability to express phonological, morphological and syntactic rules of a natural language which makes the approach easily adaptable and language independent. The formalism is implemented in PCSCHEME-a dialect of Lisp programming language and is based on top-down, depth-first search with an exhaustive evaluation property to return all possible results.

1. Introduction

The term formalism is taken to be as the systematic tool which is used to represent information about languages in a well defined system, as well as the methodology to collect this information from the definitions. In this context, the formalism has two parts: the representation side whose aim is to define a framework to express the rules of a Natural Language (NL) and the operational side which is the set of evaluations used to define how these rules will be used to retrieve information.

By separating the representational side from the operational side a certain amount of flexibility can be gained (Byrd, et. al. 1987: 7). Since the representation is considered to be completely different from the implementation, the formalism itself is not dependent on data structures (i.e., complex or feature categories) or transition networks. This property makes it possible to use the formalism without any knowledge about such data structures.

The basic properties that any formalism is expected to have are defined by Shieber (1988) as follows:

In sum, the characteristics of grammar formalisms promoted by the goals of NLP are, first of all, weak completeness and computational effectiveness. Secondarily, they are the goals of computer language design in general: expressivity, simplicity, declarativeness, rigor, etc.

Our formalism which is called Operation-Based Formalism (OBF) henceforth, has mainly the latter attributes in this regard. Specifically, OBF defines natural language knowledge as operations using variables by reducing unification into its basic operations.

The main idea of the present formalism is to define a Natural Language (NL) rule as a number of operations using variables. Each operation is well-defined within the formalism, and the information gathering process is directed by these operations. Since the information gathering process is the successive calculation of variables, this is called evaluation.

1

Page 2: An Operator Based System for Natural Language Analysisweb.iku.edu.tr/.../anoperatorbasedsystemfornaturallanguageanalys…  · Web viewAn Operator Based System for Natural Language

Depending on the operations, the formalism has the power to define not only the syntax of a NL but also its morphology, phonology, and possibly the semantics. Such a complete approach is possible as a result of defining any kind of operations about the language. Therefore OBF can be used for both specific and general applications of Natural Language Processing (NLP).

OBF allows general rules of NLs to be defined easily. However, in some cases the generality can result in inefficiencies. Under certain conditions, trying to simplify some of the expressions in the rules may help to prevent unnecessary evaluations. The simplification algorithm in OBF also prevents infinite calculations.

OBF defines a set of expressions to define the NL rules. Some expressions can be reduced to equivalent but more efficient ones. An optimisation algorithm would be implemented to increase the efficiency.

Also the lexicon design is very simple since the information to build f-structures about the lexicon entries is not required with the present approach. With OBF, each category becomes a terminal entry and the words are collected to obtain a category which is referenced within non-terminal rules.

The organisation of the paper is as follows: In section 2, we discuss one of the important properties of the system, that is the non-feature based characteristics and provide examples from different natural languages. Section 3 describes the formalism in greater depth by introducing the terminology and the set and string operators. In section 4 and 5 morphologic and phonologic processes that the present formalism can provide support are given with examples from various languages. Section 6 explains the implementation phases of the formalism. Finally, the Appendix provided at the end describes the BNF definition of the non-terminal rules of the system.

2. The Non-Feature Based ApproachOne of the basic properties of our system is its non-feature based characteristics. Shieber states the feature structures to be “a set of graphs over a finite set of arc labels and a finite set of atomic values” (Shieber 1988: 7). Shieber, also defines the unification as the “combination of two sets of feature structures that involves taking the union of the feature/value pairs and, in case both sets have values for the same feature, combining these values recursively”. Feature-structures (f-structures) or complex categories1 and the corresponding formalisms based on achieved the status of being in the mainstream of NLP applications. Shieber describes the importance of the unification in the following paragraph:

Reliance on unification is in happy concurrence with linguistic practice, since unification is a primary operation in many current linguistic grammar formalisms; moreover, its typical applications pattern matching, equality testing, and feature passing are found in an even wider range of linguistic analyses. Unification can also be used to model analyses with many other combining operations and can sometimes even substitute for string operations other than concatenation. (Shieber 1988: 12).

F-structures define a set of atomic values or complex categories as features. Building unification over f-structures gives the power to express non-CFG (Context Free Grammar) rules for natural languages. Additionally the unification tool accepted as efficacious in linguistic area.

However, employing f-structures increase the complexity of the implementation algorithms. Another problem is the lack of the declarative property. Normally, a CFG rule to define the syntactical rule and a set of rules to specify the agreements between the features should be

1 (Reyle and Rohrer 1988: 4).

2

Page 3: An Operator Based System for Natural Language Analysisweb.iku.edu.tr/.../anoperatorbasedsystemfornaturallanguageanalys…  · Web viewAn Operator Based System for Natural Language

defined. All features of a word in the lexicon are obligatorily specified, although all syntactical rules may not need to reference them. Since the words will be declared with their features, the design of the lexicon for a language becomes a tedious job. To show the complication let us give an example (Boguaraev 1988: 125) of a lexicon entry from a typical unification-based application.

(believe Verb (Sense 3)((Takes NP Sbar) (Type 2))((Takes NP NP Inf) (Type 2 ORaising))((or ((Takes NP NP NP) (Type 2 ORaising))

((Takes NP NP AuxInf) (Type 2 ORaising))))((or ((Takes NP NP AP) (Type 2 ORaising))

((Takes NP NP AuxInf) (Type 2 ORaising)))))

Designing a lexicon in this way would not be so pleasant, especially when a real Natural Language Application (NLA) needs thousands of lexicon entries (for example see Byrd et. al. 1987). Each lexical item should be analysed considering all kinds of possible atomic values that would be referenced by other unifications of the implementation.

Finally, Sijtsma (1994) states that unification is not sufficient to define all possible agreement rules. Consider the examples:

1) English: John (sg) and Mary(sg) walk. (Sijtsma 1994: 182)2) Slovene: ta streha in gnezdo na njej mi bosta ostala spominu (Sijtsma 1994: 185)

that roof(fem) and the-nest(neut) on it to-me will remain(masc) in memory‘I will remember that roof and the nest on it’

The complication of example (1) is due to the result of plural-singular agreement of English. Normally, a singular subject will be followed by a singular verb and a plural subject will be followed by a plural verb. However the given example has two singular subjects that are conjuncted by conjunction ‘and’ and followed by a plural verb.

Example (2) shows the failure of gender agreement of Slovene. There are two conjuncted nouns where each has a different gender followed by a verb with another gender. Example 2 is defined as ‘hard case’ by Sijtsma because

Whereas the conjuncts only contain feminine and neuter nouns, the entire coordinate clause triggers masculine agreement on the predicate. (Sijtsma 1994: 185).

Similar examples can be given from Turkish:

3a) Ben ve sen okula gidiyoruz.I (1st person, sing) and you (2nd person, sing) school (dat) go (1st person, plur)I and you (we) are going to the school.

4a) Ali ve sen okula gidiyorsunuz.Ali (3rd person, sing) and you (2nd person, sing) school (dat) go (2nd person, plur)Ali and you are going to the school.

5a) Mehmet ve Ali okula gidiyorlar.Mehmet (3rd person, sing) and Ali (3rd person, sing) school (dat) go (3rd person, plur)Mehmet and Ali are going to the school.

On the other hand the following examples are not grammatical:

3

Page 4: An Operator Based System for Natural Language Analysisweb.iku.edu.tr/.../anoperatorbasedsystemfornaturallanguageanalys…  · Web viewAn Operator Based System for Natural Language

3b) *Ben ve sen okula gidiyorsunuz.I (1st person, sing) and you (2nd person, sing) school (dat) go (3rd person, plur)I and you (we) are going to the school.

4b) *Ali ve sen okula gidiyorlar.Ali (3rd person, sing) and you (2nd person, sing) school (dat) go (2nd person, plur)Ali and you are going to the school.

5b) *Mehmet ve Ali okula gidiyoruz.Mehmet (3rd person, sing) and Ali (3rd person, sing) school (dat) go (1st person, plur)Mehmet and Ali are going to the school.

Normally subjects and verbs agree both in terms of person and also by being singular or plural for Turkish. Examples (3a), (3b), and (3c) are examples that complicate the simple agreement rules. Example (3a) follows from the rule: at least one 1st person and a number of conjuncted nouns should be followed by a 1st person plural verb. Therefore (3b) is not grammatical, since 1st person subject and conjuncted nouns are followed by the 2nd person.

(4a) is as the result of another agreement rule: A 2nd person subject conjuncted by some nouns other than 1st singular should be followed by a 2nd person plural verb. This rule causes (4b) to be ungrammatical.

(5a) is the last example from the agreement rules: Conjuncted 3rd person subjects should be followed by 3rd person plural verbs.

Sijtsma, after examining the inadequacy of the unification operation has concluded that it is necessary to define new operators to overcome the complication.

Instead of ‘unification’, in coordinate clauses we need an ‘addition’ operator to handle number agreement with and plus a ‘logical or’ operator for gender agreement with and and or and for number agreement with or. (Sijtsma 1994: 185).

Formalisms based on unification are not easily adapted to new conditions. The weak adaptation property of the unification-based systems caused by the approach that no explicit operation is defined other than the unification itself. However, NLP systems should be able to handle those operations that are not reducible to unification.

An operator-based system (OBS) employs a number of operations over the syntactic categories which are defined precisely, a general algorithm of the expression evaluation is possible. Like any arithmetic expression, syntactic expressions are proper for evaluation. In this approach, syntactic rules that are causing problem are handled by defining new operators whenever it is deemed necessary. In the operator based approach the lexicon entries are the features themselves, but now the features become variables. The problem is reduced to the definition of the lexical items that give rise to the feature. The feature structures defined in rule (i) become lexical entries of rule (ii).

i) Maria=[cat{np}, agr{number{sing}, person{3}}]Hans=[cat{np}, agr{number{sing}, person{3}}]liebt=[cat{verb}, agr{number{sing}, person{3}}]

ii) sing=Maria ? Hans ? liebt.

4

Page 5: An Operator Based System for Natural Language Analysisweb.iku.edu.tr/.../anoperatorbasedsystemfornaturallanguageanalys…  · Web viewAn Operator Based System for Natural Language

3rd person=Maria ? Hans ? liebt.np=Maria ? Hans.verb=liebt.

So, the main difference between these rules is that LHS of rule (i) becomes the RHS of rule (ii). Another difference is the ‘cat’ and ‘agr’ are not defined as lexical entries but will be defined within the syntactic rules, as in rule (iii).

iii) cat=np ? verb.agr=num ? gender.num=sing ? plural.

Hence the same information is represented in a simpler manner without increasing the number of entries. In rule (iv) the complete definition of “Hans liebt Maria” is represented. The gender agreement rule is not included in this example. The only difference from a CFG approach is the application of and operator in the first rule. Hence the rule defines that a sentence is a NP followed by a VP is called at the same time with the agreement rule and any string is grammatical if and only if both rules are satisfied at the same time. A number of or’s and and’s simulate the traditional unification operator by a small modification in the features. However, nothing has been mentioned about how problematic data presented in the examples is going to be handled by our approach, i.e. the representation problem is to be distinguished from the evaluation problem.

iv) S=NP / VP & AGR.VP= V.AGR=NUM & GENDER.NUM=SINGULAR / SINGULAR ? PLURAL / PLURAL.NP=Maria ? Hans.V=liebt.SINGULAR=Maria ? Hans ? liebt.PLURAL=wir ? ihr ? sie.

Now let us try to represent the problematic data of English within the new approach. But first we need to exemplify the normal agreement rules of English. Some of the possible rules are presented in rule (v)2.

v) S=NP / VP & AGREEMENT.NP=N ? N / VP.VP = V ? N / VP.AGREEMENT= PLURAL / PLURAL ? SING / SING.

The rules given in (v) are typical concatenation and application of and operators. NP / VP part of the first rule defines the ordering information of the sentence whereas the second part is the agreement part. In the second part the plural-singular agreement is represented. Two different rule types are combined with and operator. When evaluation algorithm is applied it will check two rules independently, i.e. the ordering information will be examined without referencing the agreement and vice versa.

Let us develop rules for examples 3, 4, and 5 using the same approach. Example 3 is about the agreement in first person (singular-plural):

vi) AGREEMENT= 1st_PERSON / CONJ / NP / 1st_PERSON_PLURAL_VERB.

2 A full implementation of English would need more rules. Rules are simplified for representational considerations.

5

Page 6: An Operator Based System for Natural Language Analysisweb.iku.edu.tr/.../anoperatorbasedsystemfornaturallanguageanalys…  · Web viewAn Operator Based System for Natural Language

Rule (vi) states that if the subject includes any 1st person nominal which is conjuncted with any other nominal the verb should be a 1st person plural verb. The same methodology would be adapted to implement the examples 4 and 5 as in rule (vii).

vii) AGREEMENT= 2nd_PERSON / CONJ / {SUBJECTS & 1st_PERSON :} / 2nd_PERSON_PLURAL_VERB ? 3rd_PERSON / CONJ / {SUBJECTS & 1st_PERSON : & 2nd_PERSON:} / 2nd_PERSON_PLURAL_VERB.

In the first part of rule (vii) the 1st person case is examined for exclusion. So, the first part of rule (vii) the subject should not include any 1st person. Similarly the second part of rule (vii) both the 1st and 2nd person cases are excluded from the subjects. The exclusion is needed because rule (vii) is valid if (vi) is not, and similarly the second part of the rule (vii) is valid if the first part is not. This hierarchy between the 1st, 2nd and 3rd person subjects is represented using a combination of not and and operators.

3. Definition of the FormalismThe current formalism is designed to define grammatical rules as expressions. In the present approach no unification algorithm has been defined precisely. The unification can be simulated by the combination of operations executed using variables. There is also no f-structure employed within the formalism.

The formalism is used to express information about a language in the form of rules. When necessary information about the language is needed, it is gathered using the evaluation algorithm on these rules. Rules comprise the static side of the formalism, whereas evaluations form the operational side which is based on the operators. Each operator defines a well-known operation to be executed when the rule is referenced. Present approach makes it possible to reduce NL into rules without considering any data-structure (like complex-structures) details of the implementation. The evaluation of the rules is directed by the operators. As a result the whole NLP operations are reduced to basic string and set operators and to a search problem.

3.1 TerminologyThe rule system is the static part of the system. A rule system is made up of terminal and non-terminal rules3. Both type of rules will have a definer on the left-hand-side and a definition on the right-hand-side and both sides are connected with an equal sign. The main difference between the Non-Terminal Rule (NTR) and Terminal Rule (TR) is, the RHS of TR contains only a list of strings, whereas the RHS of NTR comprises statements. For the sake of simplicity NTRs do not contain terminal strings and terminal rules cannot make reference to any expression type.

A variable on the LHS is called definer, whereas the whole RHS of the rule is the definition. Definition describes the conditions and operations that should be applied to evaluate the LHS. Definition side may contain expressions which can have any number of variables together with operators.

3.2 OperatorsOperators define the relations between the syntactic components. A syntactic component may be a simple variable or a statement consisting of a number of variables related with each other. Operators are either unary, which takes only one operand, or binary with two operands. Statements, or expressions are meaningful and grammatical constructions of symbols.

3 With current definition, the system does not allow the terminal rules to be defined within the non-terminals and vice versa. This approach improves the efficiency of evaluation. However, this restriction would be removed without losing any power of the formalism.

6

Page 7: An Operator Based System for Natural Language Analysisweb.iku.edu.tr/.../anoperatorbasedsystemfornaturallanguageanalys…  · Web viewAn Operator Based System for Natural Language

The operators return a result if the condition is satisfied. This property of the system allows the formalism to approach the characteristics of a knowledge-based expert system. Through satisfying conditions the number of evaluations and infinite calculations are prevented.

vi) A=B ~ C.

Each rule has a built in condition, the LHS is only true if the RHS could be satisfied. Therefore, each evaluation checks whenever the RHS part of the consecutive rules are satisfied. At least one of the operands of any operator is a condition-operand, i.e. the operation is to be satisfied only if the condition-operand of the operator is satisfied. The first operand B in rule vi of the CHANGE operator (~) is the condition-operand and the replacement of the string is going to take place if and only if the first operand is satisfied, i.e. the string would be parsed by the first operand. Rule (vi) is interpreted as A is valid if B would be replaced by C.

Generally three types of operators are defined for our system: The set operators, string operators, and others. Set operators are or, and, difference, and not operators. The string operators are concatenation, append, add-to-head, change, drop-head, and drop-tail operators. Miscellaneous operators are ghost, grouping and optional operators.

We introduce following notations: Let (..) be a set, Õ be an operator; A, B and C be variables; aaaa, bbbb, and cccc be strings, a1..an, b1..bm, c1..ck be any strings, k, m, and n be integers, and A (a1..an) is to mean that variable A evaluates a set whose only element is string a1..an

and b1..bm) A does not evaluate (b1..bm), is an empty set, U is a universal set and is an empty string.

3.3 Set operatorsA set is a collection of solutions as a result of the evaluation of any variable or definer. A definer may parse a string returning none, one or more than one parse tree. Set operators define the combination, intersection or differentiation of these evaluations. Some of the set operators may enlarge the solution set whereas others decrease at each step of the evaluation. For example, or operator increases the solution set if its operands return disjoint set of solutions, but on the other hand and operator decreases the solution set since it returns an intersection of possible solutions. In the following section we present the set and string operators used in our implementation of the formalism.

AND Operator : “&”Definition: AND is the intersection operation on sets of strings. Example: If A (aaaa) , B (aaaa) and C (cccc) then A & B (aaaa)

whereas A & C .

OR Operator : “?”Definition: OR is the union operation on sets of strings. Example: If A (aaaa, bbbb) , B (bbbb) and C (cccc) then A ? B (aaaa, bbbb), and A ? C (aaaa, bbbb, cccc).

DIFFERENCE Operator : “-”Definition: DIFFERENCE is the difference operation on sets of strings. Example: If A (aaaa, bbbb) , B (bbbb) and C (cccc) then A - B (aaaa), and expression B - A .

NOT Operator : “:”Definition: NOT is the complementation operation on sets of strings. Example: If A (aaaa) , U (aaaa, bbbb, cccc) then A : (bbbb, cccc).

7

Page 8: An Operator Based System for Natural Language Analysisweb.iku.edu.tr/.../anoperatorbasedsystemfornaturallanguageanalys…  · Web viewAn Operator Based System for Natural Language

3.4 String Operators This category of operators are the ones applied not to the evaluation sets but to the strings used in the evaluations. There are certain cases that is necessary to manipulate the original string before the evaluation takes. All of the string operators other than the concatenation modifies the original string. concatenation, change, add-head, append, drop-head, and drop-tail operators define basic string operations like string differentiation, addition and division.

CONCATENATION Operator : “/”Definition: A / Ba1..anb1..bm) and B / A b1..bma1..an) iff A (a1..an), B (b1..bm) where n,m1.Example: If A (aaaa), B (bbbb) then A / B (aaaabbbb), and B / A (bbbbaaaa).

APPEND Operator : “+”Definition: Let A (a1..an),B (b1..bm) where n,m1. Then A + B a1..anb1..bm) and B + A b1..bma1..an).Example: If A aaaa), B (bbbb) then A+B (aaaabbbb), and B+A (bbbbaaaa)4.

ADD-TO-HEAD Operator: “>”Definition: Let A (a1..an), B (b1..bm) where n,m1. Then A > B a1..anb1..bm) and B > A b1..bma1..an).Example: If A (aaaa), B (bbbb) then A>B (aaaabbbb), and B>A (bbbbaaaa).

CHANGE Operator : “~”Definition: Let A (a1..an), B (b1..bm) where n,m 1. Then A ~ B b1..bm) and B ~ A a1..an).Example: If A (aaaa), B (bbbb) then A ~ B (bbbb), and B ~ A (aaaa).

DROP-HEAD Operator : “$”Definition: Let A (a1..an), B (b1..bm) where n m1. Then A $ B a1m+1..an) if a1..am = b1..bm; A $ B otherwise.Example: If A (aaab), B (aaa), and C (aaab) then A $ B (b), and A $ C .

DROP-TAIL Operator : “^”Definition: Let A (a1..an), B (b1..bm) where n m1. Then A ^ B a1..an-m) if an-m+1..an = b1..bm; A ^ B otherwise.Example: If A (baaa), B (aaa), and C (baaa) then A ^ B (b), and A ^ C .

3.5 Miscellaneous OperatorsSome of the operators that are not amongst the set and string operators are grouping operator, optional operator and ghost operator. These are implemented in our system as defined by the following:

GROUPING OPERATOR : “{}”Definition: If andare operators where precedence of higher than precedence of and A (a1..an), and B C (b1..bn), {A B} C is higher than in precedence then in the expression {A B} C (a1..an) C whereas expression A B C A (b1..bn).

OPTIONAL OPERATOR : “[]”

4 The difference between the ADD-type operators and CONCATENATION is that ADD-type operators modify the original string by appending or adding a new string to the original string. This operation allows us to define inverse of the string deletions resulted as a number of morphologic rules in various languages.

8

Page 9: An Operator Based System for Natural Language Analysisweb.iku.edu.tr/.../anoperatorbasedsystemfornaturallanguageanalys…  · Web viewAn Operator Based System for Natural Language

Definition: Let A (a1..an), B (b1..bm) where nm1 and Õ be an operator and AÕ(c1..cn),. Then A Õ [B] a1..an, c1..cn).Example: If A (aaaa), and B (bbbb) then A / [B] (aaaa, aaaabbbb).

GHOST Operator : “<”Definition: Let P1 and P2 be two parse trees. A (P1), B (P2) and A Õ (P1 P2). Then A Õ (P1), A< Õ (P2).

4. Morphologic Processes

One of the advantages of the present approach is that it can handle morphologic processes. Some of the morphologic processes which can be implemented under OBF are: concatenative morphology, infixation, circumfixation, reduplication, subsegmental morphology, zero morphology and substractive morphology.

4.1 Concatenative Morphology

The simplest model of morphology that one can imagine is the situation where a morphologically complex word can be analyzed as a series of morphemes concatenated together (Sproat 1992: 44).

The concatenation operator ‘/’ is the corresponding operator for the Concatenative Morphology. Concatenation operator divides the strings into corresponding morphemes as stated by the morphologic rule.

4.2 Infixation

Apart from attaching to the left or right edge of a word, affixes may also attach as infixes inside words. Infixing is not uncommon cross-linguistically, and some language groups, such as the languages spoken in the Philippines, makes heavy use of it (Sproat 1992: 45).

Infixation needs some care during the implementation. As the rules governing the operation is dependent upon the language under consideration, the main problem here is to find the exact rules of infixation. Therefore, a simple operator which describes the infixation is not defined, instead there are a number of operators which are used to modify the original string to make it proper to match the lexicon entries and to be reduced into a simple concatenation operator. In order to justify what is stated above let us take an example from the Phillippine language Bontoc (Sproat 1992: 45).

6) fikas f / um / ikasstrong be strong

In example (6), a verb is derived from an adjective by infixing the -um affix. The rule is to add the affix just after the first consonant5. Such a rule would be implemented as in rule (vii):

vii) ADJ_TO_VERB={CONSONANT ^ UM} / ANY & ADJ.CONSONANT=f.UM=um.ADJ=fikas.STRING=ALPHABET ? ALPHABET / ANY.

5 Sproat gives a wide argument about the description of the infixation (Sproat 1992: 46-50). Here we do not want go deeper into the subject. We rather prefer to show how the infixation operation would be implemented using simple examples.

9

Page 10: An Operator Based System for Natural Language Analysisweb.iku.edu.tr/.../anoperatorbasedsystemfornaturallanguageanalys…  · Web viewAn Operator Based System for Natural Language

The basic idea is to drop the -um string within the word when the evaluation (parsing) will take place. The drop-tail operator: ‘^’ will drop any string after the consonant which matches the string defined by the variable UM. After performing the drop operation the remaining string will be tried to be matched with ADJ. The variable ANY will match any string containing alphabets of the language. As a result the infixing is implemented by a drop-tail operator additionally by concatenation and and.

4.3 CircumfixationA somewhat natural antithesis to infixes are circumfixes, affixes which attach discontinuously around a stem. Not surprisingly, when one finds such cases, they are usually composed of a suffix and a prefix, each of which may function independently as morphemes. The argument for analyzing the combination as a discontinuous morpheme is that the circumfix has a function that is not derivable from the behavior of the prefix and the suffix of which the circumfix is composed (Sproat 1992: 50).

Circumfixation would be considered as application of both prefix and suffixing at the same time. If the language contains too many words to be included in the lexicon, circumfixed words should be handled somehow. A solution within the OBS (Operator-Based System) employing both drop-head and drop-tail operators at the same time, the string would be transformed into the simplest form of the language. To show the operation example from German is given in example (7):

7) machen ge / mach / tto do done

For regular verbs of German transforming a verb into past form needs circumfixing -ge and -t the verb after dropping -en from the infinitive form. The rule would be implemented using OBS methodology as in (viii):

viii) PAST= STRING $ GE ^ T + EN & INFINITIVE.INFINITIVE=machen..GE=ge.T=t.

Rule (viii) states that before refering the INFINITIVE the string should drop the heading string represented by the variable GE (which is an entry of lexicon evaluates string ge) and the tail string represented by T. After the dropping operations take place the remaining string should be appended by EN (string en). The modified string should be an entry of INFINITIVE. In this case a morphologic rule has been described using the basic string operators plus the and operator.

4.4 Reduplication

Reduplication comes in two flavors. One, total reduplication, is used, for example, to mark plurals in Indonesian:

(54) orang orang+orang‘man’ ‘men’

In this case, whatever phonological content is in the base word is copied. However, there are cases of total reduplication where things are not so simple (Sproat 1992: 57).

Likely, reduplication is present in Turkish (O’Grady, et. al. 1992: 127). The words “çabuk çabuk” (/*quick quick */ very quickly), “yavaş yavaş” (/*slowly slowly */ very slowly), “akşam akşam” (/*evening evening */ at the evening), etc. These duplication examples are categorised as full reduplication and copying only a part of the word is described as partial reduplication by O’Grady (O’Grady, et. al. 1992: 127).

10

Page 11: An Operator Based System for Natural Language Analysisweb.iku.edu.tr/.../anoperatorbasedsystemfornaturallanguageanalys…  · Web viewAn Operator Based System for Natural Language

Reduplication is easy to implement using OBS, since there are operators to modify the original string before any parsing takes place. With this approach, the duplicated string is dropped by either drop-head or drop-tail operator.

ix-a) REDUPLICATION=STRING $ ORANG & ORANG.ix-b) REDUPLICATION=STRING ^ ORANG & ORANG.

Rule (ix-a) states that REDUPLICATION rule is satisfied if the string is something that after the string(s) defined by the ORANG is dropped the remaining string is also defined by ORANG. If ORANG defines the string “orang” the only string that will be accepted is the “orangorang” which is the reduplicated one. Rule (ix-b) states that REDUPLICATION will be satisfied if ORANG is dropped from the head of the string that will be parsed, and the remaining string is again ORANG. The symmetry of reduplication allows both operands to be used in this case.

4.5 Subsegmental Morphology

Morphemes can also consist of less than a segment. An example of this is one kind of plural formation in Irish:(59) cat (/kat/) cait (/katj/) ‘cat’ ‘cats’In these cases the final consonant of the singular is palatalized to form the plural. (Sproat 1992: 61).

In this case the stem is changed which results in the change in the meaning. For the example given, the final consonant is palatalised in order to form a plural. Sproat (1992: 62-63) gives examples from Icelandic and Ngbaka of Zaire for similar subsegmental morphologies. In the Icelandic the strong verbs stem vowels are modified to form past tenses and in Ngbaka verb forms are constructed with different tense-aspect contrasts.

The methodology to implement these rules is to delete or change the necessary alphabets within the verb to reach the stem itself. For example if we drop the last vowel from the “cait” we obtain “cat”, and if including both forms within the lexicon is not preferred, the only form of the word will be on hand to be match. The possible rule to implement such a case would be one like (x).

x) PLURAL={STRING ^ I} / CONSONANT & NOUN.

Rule (x) states that the plural form of a noun is satisfied only if the remaining string matches a NOUN when the last -i- of the word is dropped.

4.6 Zero Morphology

In addition to the various ways in which morphology can add phonological material to a stem, a morphological operation may also have no phonological expression whatsoever. (Sproat 1992: 64).

Sproat defines zero morphology6 as addition of the zero morpheme to a stem to allow a word to act more than one categories for example “book” both as noun and verb. If the existence of this morphology is accepted, such a relation would be implemented by categorising these kind of words as different groups in the lexicon and by linking them within the syntactic rules as in rule (xi).

6 O’Grady, et. al. defines the zero morphology as conversion or zero derivation (O’Grady, et. al. 1992: 138).

11

Page 12: An Operator Based System for Natural Language Analysisweb.iku.edu.tr/.../anoperatorbasedsystemfornaturallanguageanalys…  · Web viewAn Operator Based System for Natural Language

xi) NOUNS= ONLY_NOUNS ? NOUNS_AND_VERBS.VERBS= ONLY_VERBS ? NOUNS_VERBS.ONLY_NOUNS = cat ? dog ? …ONLY_VERBS = do ? go ? …NOUNS_AND_VERBS= book ? run ? …

Another implementation would be to allow null-string to be included within the formalism, however the current definition does not allow such an approach because of the theoretical implications.

4.7 Subtractive Morphology

Not surprisingly, perhaps, there are also a limited number of examples of subtractive morphology, where material is taken away to mark a morphological operation (Sproat 1992: 64).

Subtractive morphology7 is something that will drop a part from a stem to mark a morphologic operation. Normally, there are operators defined to invert the drop operation adding strings to the words. However the morphologic rule may be very complicated by including other operations. In this case it may not always be possible to know the exact insertion rule. However, this kind of problems are language-dependent and applying other operators would be solved.

The operators used to insert strings are add-head and append operators. One of them can be used according to a given situation.

5. Phonologic Processes

Some of the phonologic processes which are implemented under OBF are: deletion and insertions, and harmony rules.

5.1 Deletion and Insertions

Our formalism handles both deletion and insertion operations. Two kinds of insertion operations add-to-head and append; and two deletion operations: drop-head and drop-tail are defined. There are some cases where one or the other should be used.

xii) DAR_UNLUNUN_DUSMESI=SOZ / {SESSIZ_HARF + DAR_UNLU} / SESSIZ_HARF.DROP_HIGH_VOWEL=STRING / {CONSONANT + HIGH_VOWEL} / CONSONANT.

Rule (xii) is taken from Turkish phonology which define the dropping process of a high vowel if it is between two consonants. Before parsing the word, the dropped segment, which is a high vowel (ı, i, u, ü), should be added to find the deep form.

8) “burun” + “u” “burnu”(the nose + acc-suffix to the nose-acc)

The suffixing in example 8 is done after dropping the last high vowel as in example 9.

9) drop_last_high_vowel(“burun”)+”u”

7 O’Grady, et. al. uses term clipping for subtractive morphology (O’Grady, et. al. 1992: 139).

12

Page 13: An Operator Based System for Natural Language Analysisweb.iku.edu.tr/.../anoperatorbasedsystemfornaturallanguageanalys…  · Web viewAn Operator Based System for Natural Language

“burn” + “u” “burnu”

At the parsing “burnu” should be modified adding a “u” to evaluate “burunu” which is the concatenation of “burun” and “u” and both are defined within the lexicon (“burun” as noun, and “u” as the accusative prefix). To recognise the string correctly , “u” should be inserted before parsing takes place and append operation executes this insertion operation in this case.

5.2 Long-distance effects: Harmony RulesIn order to show that OBF handles the vowel harmony, we present examples from Turkish after a brief overview of the Turkish vowel harmony.

5.2.1 Turkish Vowel HarmonyTurkish has 8 vowels: a,e,ı,i,o,ö,u, and ü. The first syllable of a word may contain any of the vowels. However, the following syllables will have only those vowels that are conditioned by the vowel harmony rules. Table 1 shows the relations between the vowels as suggested by the harmony rules (Underhill 1976: 25).

Table 1: Relation Between Turkish Vowels as Harmony Rules Suggest

Preceding Vowel Following Vowele e,Ii e,iö e,üü e,üa a,ıı a,ıo a,uu a,u

The major harmony allows words whose last vowel is back (a,ı,o,u) to be appended a suffix whose first vowel is also back, and front (e,i,ö,ü) word with front suffix. “okulu”, “okula” and “okulo” will be accepted if the major harmony rule applied on its own.

The minor harmony allows only the following constructions: If the last vowel of the word has a unrounded vowel (a,e,ı,i) the next suffix should have an unrounded vowel. If first word has a rounded vowel (o,ö,u,ü) the suffix should have either a low and unrounded (a, e) or high and rounded (u,ü) vowel.

Tables 2, 3, and 4 define the vowel categories of Turkish (Underhill 1976: 23-24; Gencan 1979: 44).

Table 2: High and Low Vowels of Turkish.

High Lowi eı au oü ö

13

Yusuf Altunel, 01/03/-1,
Page 14: An Operator Based System for Natural Language Analysisweb.iku.edu.tr/.../anoperatorbasedsystemfornaturallanguageanalys…  · Web viewAn Operator Based System for Natural Language

Table 3: Rounded and Unrounded Vowels of Turkish

Rounded Unroundedo aö eu ıü i

Table 4: Front and Back Vowels of Turkish

Front Backi ıe aö oü u

A.The following vowel assimilates to the preceding vowel in frontness; that is, front vowels must be followed by front vowels, and back vowels must be followed by back vowels.B. A following high vowels assimilates to the preceding vowel in rounding; that is, high

vowels are rounded after a rounded vowel; that is, high vowels are rounded after a rounded vowel, unrounded after an unrounded vowel.

C. A following low must be unrounded; that is, o and ö may not appear in any syllable except the first in a Turkish word (Underhill 1976: 1).

When minor harmony is applied alone it does not cover all of the restrictions. The major harmony is also incomplete without the minor harmony. The rule is implemented only when both of them are applied at the same time and this is what the and operator does. The harmony rule controls the string to check if the minor harmony is satisfied. If it is, the same string is controlled with the major harmony rule. If both of them are satisfied, the string is accepted according to the harmony rules else it is rejected.

10) “okulu” “okul” + “u”(the school-acc. “okul” + acc-sufix)

The string “okulu” given in example (10) is grammatical whereas the string “okuli” given in example (11) is not grammatically constructed according to the harmony rules.

11) “okuli” “okul” + “i” (the school-acc* “school” + acc-suffix)

These rules could be implemented as given in rule (xiii).

xiii) VOWEL_HARMONY = MINOR_HARMONY & MAJOR_ HARMONY.MINOR_HARMONY =

{LAST_VOWEL_UNROUNDED / {SUFFIX & FIRST_VOWEL_ UNROUNDED} ?LAST_VOWEL_ROUNDED / {SUFFIX & {FIRST_VOWEL_LOW_UNROUNDED ?

FIRST_VOWEL_HIGH_ROUNDED }}}.MAJOR _HARMONY=

{LAST_VOWEL_BACK / {SUFFIX & FIRST_VOWEL_BACK} ?LAST_VOWEL_FRONT / {SUFFIX & FIRST_VOWEL_FRONT}}.

6. Implementation

14

Page 15: An Operator Based System for Natural Language Analysisweb.iku.edu.tr/.../anoperatorbasedsystemfornaturallanguageanalys…  · Web viewAn Operator Based System for Natural Language

OBF has been in a PC environment using PCSCHEME which is a dialect of Lisp implemented (Altunel : 1997). The implementation is based on a top-down, depth-first search algorithm and left-to-right evaluation returning all possible results once a time. The system has a querying mechanism to ask questions like “is ‘loves’ a verb ?” in Lisp-like format and return the corresponding result(s) depending on the knowledge-base that the system has. The so-called knowledge-base consists of the file of non-terminal rules henceforth Rule Definition File (RDF), file of terminals - Terminals Definition File (TDF)-, and a list of functions that are defined by the user called User Functions Definition File (UFDF). The program when started loads RDF8, TDF, and UFDF to the memory. The evaluation9 is initialised after a query is entered by the user asking information about a phrase of the language. Evaluation searches the Temproray Memory (TM) to check if the phrase has been evaluated previously. If it is not found UFDF, TDF, and RDF are searched in order. When the phrase is found as an expression the evaluation of that expression takes place as a recursive operation. The evaluation stops either if the expression is not properly evaluated or fully evaluated returning a list of parse trees including all possible parsings of the string.

A very simple string generator is also developed. This is due to the necessity of some operations that need string generation when they are evaluated. For example string generation operators change, add-head, append, drop-head, and drop-tail are all need string generation operations. The first operand of these operators controls the string and the second operands generates strings to manipulate the original string.

7. Conclusions

We have presented an operator-based formalism which is useful in natural language applications. The formalism comprises set and string operators and also variables to define grammatical rules and terminal symbols. Normally each variable corresponds to a phonologic, morphologic and syntactic items of a natural language. The system is algorithmic in the sense that each operator defines an operation to be executed using these variables, and each operation is usually implemented quite easily.

One of the advantages of the present approach can be stated as it can be utilised as a language-independent tool. It is a complete (or near complete) approach to the NLs stating that syntactic, morphologic, phonologic rules are implemented without falling outside of the formalism.

Currently, work is under way to develop a language learning front-end algorithm. This algorithm will take a sufficiently large set of strings of a language and generate a corresponding grammar using the present formalism.

8 The expressions in the RDF are transformed from infix to the prefix notation as the evaluation algorithm necessiates.

9 Throughout the paper, evaluation is used to define the implementation processes which is used to generate the necessary information or structures from the rulebase like evaluation lists or parse trees.

15

Page 16: An Operator Based System for Natural Language Analysisweb.iku.edu.tr/.../anoperatorbasedsystemfornaturallanguageanalys…  · Web viewAn Operator Based System for Natural Language

References Altunel, Yusuf (1997). Design and Implementation of an Operator-Based Rule System for

Natural Language Analysis. M.Sc. Thesis. Middle East Technical University, Department of Computer Engineering. Ankara.

Anderson, Stephen R. (1995). A-Morphus Morphology, Cambridge University Press.Boguraev, Branimir (1988). “A Natural Language Toolkit”. U. Reyle and C. Rohrer(eds.).

Natural Language Parsing and Linguistic Theories. D. Reidel Publishing Company, Dordrecht.

Byrd, Roy J., Calzolari, Nicoletta, Chodorow, Martin S., Klavans, Judith L., Neff, Mary S., Rizk, Omneya A.. (1987). Tools and Methods for Computational Lexicology, Computational Linguistics, 13(3-4).

Gencan, Tahir (1979). Nejat. Dilbilgisi, Türk Dil KurumuYayınları. Ankara.Reyle, U. and Rohrer, C. (1988).Natural Language Parsing and Linguistic Theories. D. Reidel

Publishing Company, Dordrecht. Shieber, Stuart M. (1988). “Separating Linguistic Analyses from Linguistic Theories”. U.

Reyle and C. Rohrer(eds.). Natural Language Parsing and Linguistic Theories. D. Reidel Publishing Company, Dordrecht.

Shieber, Stuart M. (1986). An Introduction to Unification-Based Approaches to Grammar. Center for the Study of Language and Information, Leland Stanford Junior University.

Sijtsma, Chris. (1994).“Are Features Inherited? ”.Carlos Martin Vide (ed.), Current Issues in Mathematical Linguistics. Elsevier.

Sproat, Richard (1992). Morphology and Computation. Cambridge, MA: MIT Press. Underhill, Robert (1976). Turkish Grammar , Cambridge, MA: MIT Press.Vide, Carlos Martin (1994). Current Issues in Mathematical Linguistics, Elsevier.O’Grady, William and Dobrovolsky, Michael (1997). Contemporary Linguistics. St. Martin’s Press, New York.

16

Page 17: An Operator Based System for Natural Language Analysisweb.iku.edu.tr/.../anoperatorbasedsystemfornaturallanguageanalys…  · Web viewAn Operator Based System for Natural Language

Appendix A : The BNF Definition of Formalism

<RULE_BASE> ::= <RULES> <END_OF_RULES_SYMBOL><RULES> ::= <RULE> [<RULES>]<RULE> ::= <DEFINER> <EQUATION_SYMBOL> <DEFINITION>

<END_OF_STATEMENT_SYMBOL><DEFINER> ::= <NONTERMINAL_ITEM><DEFINITION> ::= <STATEMENT><STATEMENT> ::= <COMMENT_STATEMENT> | <SIMPLE_STATEMENT> |

<REPETITION_STATEMENT> | <OPTIONAL_STATEMENT> |<OR_STATEMENT> | <AND_STATEMENT> | <DIFFERENCE_STATEMENT> | <CONCATENATION_STATEMENT> | <REPLACE_STATEMENT> | <CHANGE_STATEMENT> |<DROP_HEAD_STATEMENT> | <DROP_TAIL_STATEMENT> |<APPEND_STATEMENT> | <NOT_STATEMENT> | <GHOST_STATEMENT> | <ESCAPE_STATEMENT>

<COMMENT_STATEMENT> ::= <COMMENT_SYMBOL> <COMMENT_BODY> <COMMENT_SYMBOL>

<COMMENT_BODY> ::= <STRING><SIMPLE_STATEMENT> ::= <NONTERMINAL_ITEM><REPETITION_STATEMENT> ::= <REPETITION_LEFT_BRACKET> <STATEMENT

<REPETITION_RIGHT_BRACKET><OPTIONAL_STATEMENT> ::= <OPTIONAL_LEFT_BRACKET> <STATEMENT>

<OPTIONAL_RIGHT_BRACKET><OR_STATEMENT> ::= <STATEMENT> <OR_SYMBOL> <STATEMENT><AND_STATEMENT> ::= <STATEMENT> <AND_SYMBOL> <STATEMENT><DIFFERENCE_STATEMENT> ::= <STATEMENT> <DIFFERENCE_SYMBOL>

<STATEMENT><CONCATENATION_STATEMENT> ::= <STATEMENT> <CONCATENATION_SYMBOL>

<STATEMENT><REPLACE_STATEMENT> ::= <STATEMENT> <REPLACE_SYMBOL> <STATEMENT><CHANGE_STATEMENT> ::= <STATEMENT> <CHANGE_SYMBOL> <STATEMENT><DROP_HEAD_STATEMENT> ::= <STATEMENT> <DROP_HEAD_SYMBOL>

<STATEMENT><DROP_TAIL_STATEMENT> ::= <STATEMENT> <DROP_TAIL_SYMBOL>

<STATEMENT><APPEND_STATEMENT> ::= <STATEMENT> <APPEND_SYMBOL> <STATEMENT><NOT_STATEMENT>::= <STATEMENT> <NOT_SYMBOL><GHOST_STATEMENT> ::= <STATEMENT> <GHOST_SYMBOL><ESCAPE_STATEMENT> ::= <CHARACTER> <ESCAPE_SYMBOL><NON_TERMINAL_ITEM> ::= <NONTERMINAL_CHARACTER>

[<NON_TERMINAL_ITEM>]<TERMINAL_ITEM> ::= <TERMINAL_CHARACTER> [<TERMINAL_ITEM>]<STRING> ::= <CHARACTER> [<STRING>]<EQUATION_SYMBOL> ::= =<END_OF_STATEMENT_SYMBOL> ::= .<COMMENT_SYMBOL> ::= %<REPETITION_LEFT_BRACKET> ::= {<REPETITION_RIGHT_BRACKET> ::= }<OPTIONAL_LEFT_BRACKET> ::= [<OPTIONAL_RIGHT_BRACKET> ::= ]<OR_SYMBOL> ::= ?

17

Page 18: An Operator Based System for Natural Language Analysisweb.iku.edu.tr/.../anoperatorbasedsystemfornaturallanguageanalys…  · Web viewAn Operator Based System for Natural Language

<AND_SYMBOL> ::= &<DIFFERENCE_SYMBOL> ::= -<CONCATENATION_SYMBOL> ::= / | WHITE_SPACE<REPLACE_SYMBOL> ::= @<CHANGE_SYMBOL> ::= ~<DROP_HEAD_SYMBOL> ::= $<DROP_TAIL_SYMBOL> ::= ^<APPEND_SYMBOL> ::= +<NOT_SYMBOL> ::= :<GHOST_SYMBOL> ::= <<ESCAPE_SYMBOL> ::= !<END_OF_RULES_SYMBOL> ::= EOF<CHARACTER> ::=a | b | c | ç | d | e | f | g | ğ | h | i | ı | j | k | l | m | n | o | ö | p | r | s | ş | t | u | ü | v | y | z | A | B | C | Ç | D | E | F | G | Ğ | H | I | İ | J | K | L | M | N | O | Ö | P | R | S | Ş | T | U | Ü | V | Y | Z | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0 |! | @ | # | $ | % | ^ | & | * | ( | ) | - | _ | = | + | \ | & | ? | | | . | > | , | < | ‘ | “<NONTERMINAL_CHARACTER> A | B | C | Ç | D | E | F | G | Ğ | H | I | İ | J | K | L | M | N | O | Ö | P | R | S | Ş | T | U | Ü | V | Y | Z | _ |<TERMINAL_CHARACTER> a | b | c | ç | d | e | f | g | ğ | h | i | ı | j | k | l | m | n | o | ö | p | r | s | ş | t | u | ü | v | y | z | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0<WHITE_SPACE> ::= NEWLINE | SPACE | PAGE | RETURN | RUBOUT | TAB

18