by tim adrian gareau edward dantsiguer. agenda 4 1.0 definitions 4 1.1 characteristics of successful...

70
By Tim Adrian Gareau Edward Dantsiguer

Post on 22-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

By Tim

Adrian Gareau

Edward Dantsiguer

Agenda 1.0 Definitions 1.1 Characteristics of Successful Machines 1.2 Practical Applications

– 1.2.1 machine translation

– 1.2.2 database access

– 1.2.3 text interpretation• 1.2.3.1 information retrieval

• 1.2.3.2 text categorization

• 1.2.3.3 extracting data from text

2.0 Efficient Parsing 3.0 Scaling up the Lexicon 4.0 List of References

Current Topic 1.0 Definitions1.0 Definitions 1.1 Characteristics of Successful Machines 1.2 Practical Applications

– 1.2.1 machine translation

– 1.2.2 database access

– 1.2.3 text interpretation• 1.2.3.1 information retrieval

• 1.2.3.2 text categorization

• 1.2.3.3 extracting data from text

2.0 Efficient Parsing 3.0 Scaling up the Lexicon 4.0 List of References

1.0 Definitions

Natural languages are languages that living creatures use for communication

Artificial Languages are mathematically defined classes of signals that can be used for communication with machines

A language is a set of sentences that may be used as signals to convey semantic information

The meaning of a sentence is the semantic information it conveys

Current Topic 1.0 Definitions 1.1 Characteristics of Successful Machines1.1 Characteristics of Successful Machines 1.2 Practical Applications

– 1.2.1 machine translation

– 1.2.2 database access

– 1.2.3 text interpretation• 1.2.3.1 information retrieval

• 1.2.3.2 text categorization

• 1.2.3.3 extracting data from text

2.0 Efficient Parsing 3.0 Scaling up the Lexicon 4.0 List of References

1.1 Characteristics of Successful Natural Language Systems Successful systems share two properties:

– they are focused on a particular domain rather than allowing discussion of any topic

– they are focused on a particular task rather than attempting to understand language completely

The above means that any natural language machine is more likely to work correctly if one is to restrict the set of possible inputs -- input possibility size is inversely proportional to likelihood of success

Current Topic 1.0 Definitions 1.1 Characteristics of Successful Machines 1.2 Practical Applications1.2 Practical Applications

– 1.2.1 machine translation 1.2.1 machine translation

– 1.2.2 database access 1.2.2 database access

– 1.2.3 text interpretation1.2.3 text interpretation• 1.2.3.1 information retrieval 1.2.3.1 information retrieval

• 1.2.3.2 text categorization 1.2.3.2 text categorization

• 1.2.3.3 extracting data from text1.2.3.3 extracting data from text

2.0 Efficient Parsing 3.0 Scaling up the Lexicon 4.0 List of References

1.2 Practical Applications

We are going to look at five practical applications of natural language processing:– machine translation (1.2.1)

– database access (1.2.2)

– text interpretation (1.2.3)• information retrieval (1.2.3.1)

• text categorization (1.2.3.2)

• extracting data from text (1.2.3.3)

1.2.1 Machine Translation First suggestions made by the Russian Smirnov-Troyansky

and the Frenchman C.G. Artsouni in the 1930’s First serious discussions were begun in 1946 by

mathematician Warren Weaver– There was great hope that computers would be able to translate from

one natural language to another (inspired by the success of the Allied efforts using the British Colossus computer)

• Turing’s project “translated” coded messages into intelligible German

By 1954 there was a machine translation (MT) project at Georgetown University– succeeded in correctly translating several sentences from Russian into

English

After Georgetown project, MT projects were started up at MIT, Harvard and the University of Pennsylvania

1.2.1 Machine Translation (Cont) It soon (1966) became apparent that translation is a very

complicated task and it would be practically impossible to account for all intricacies and nuances of natural languages– correct translation would require an in-depth understanding of both

natural languages since structure of expressions varies in every natural language

– Yehoshua Bar-Hillel declared that MT was impossible (Bar-Hillel Paradox):

• analysis by humans of messages relies to some extent on the information which is not present in the words that make up the message

– “The pen is in the box”

» [i.e. the writing instrument is in the container]

– “The box is in the pen”

» [i.e. the container is in the playpen or the pigpen]

1.2.1 Machine Translation (Cont)

There has been no fundamental breakthroughs in machine translation in the last 34 years

Progress has been made on restricted domains – there are dozens of systems that are able to take a subset of one

language and, fairly accurately, translate it to another language – these systems operate well enough to save significant sums of

money over fully manual techniques (see examples two pages down)

From the above systems, ones operating on a more restricted set, produce more impressive results

Machine translation is NOT automatic speech recognition

1.2.1 Machine Translation (Cont)

Examples of poor machine translations would include:– "the spirit is strong, but the body is weak" was translated literally

as "the vodka is strong but the meat is rotten”

– "Out of sight, out of mind” was translated as "Invisible, insane”

– "hydraulic ram” was translated as "male water sheep”

These do not imply that machine translation is a waste of time– some mistakes are inevitable regardless of the quality and

sophistication of the system

– one has to realize that human translators also make mistakes

1.2.1 Machine Translation (Cont) Examples machine translation systems include:

– TAUM-METRO system• translates weather reports from English to French

• works very well since language in government weather reports is highly stylized and regular

– SPANAM system• translates Spanish into English

• worked on a more open domain

• results were reasonably good although resulting English text was not always grammatical and very rarely fluent

– AVENTINUS system• advanced information system for multilingual drug enforcement

• allows law enforcement officials to know what the foreign document is about

• sorts, classifies and analyzes drug related information

1.2.1 Machine Translation (Cont)

There are three basic types of machine translation:– Machine-assisted (aided) human translation (MAHT)

• the translation is performed by human translator, but he/she uses a computer as a tool to improve or speed up the translation process

– Human-assisted (aided) machine translation (HAMT)• the source language text is modified by human translator either

before, during or after it is translated by the computer

– Fully automatic machine translation (FAMT)• the source language text is fed into the computer as a file, and the

computer produces a translation automatically without any human intervention

1.2.1 Machine Translation (Cont)

Standing on its own, unrestricted machine translation (FAMT) is still inadequate– Human-assisted machine translation (HAMT) could be used to

improve the quality of translation • one possibility is to have a human reader go over the text after the

translation, correcting grammar errors (post-processing)– human reader can save a lot of time since some of the text will be

translated correctly

– sometimes a monolingual human can edit the output without reading the original

• another possibility is to have a human reader edit the document before translation (pre-processing)

– make the original to conform to a restricted subset of a language

– this will usually allow the system to translate the resulting text without any requirement for post-editing

1.2.1 Machine Translation (Cont)

Restricted languages are sometimes called “Caterpillar English”– Caterpillar was the first company to try writing their manuals

using pre-processing

– Xerox was the first company to really successfully use of the pre-processing approach (SYSTRAN system)

• language defined for their manuals was highly restricted, thus translation into other languages worked quite well

There is a substantial start-up cost to any machine translation effort– to achieve broad coverage, translation systems should have

lexicons of 20,000 to 100,000 words and grammars of 100 to 10,000 rules (depending on the choice of formalism)

1.2.1 Machine Translation (Cont) There are several basic theoretical approaches to machine

translation:– Direct MT Strategy

• based on good glossaries and morphological analysis

• always between a pair of languages

– Transfer MT Strategy• first, source language is parsed into an abstract internal representation

• a ‘transfer’ is then made into the corresponding structures in the target language

– Inerlingua MT Strategy• the idea is to create an artificial language

– it shares all the features and makes all the distinctions of all languages

– Knowledge-Based Strategy• similar to the above

• intermediate form is of semantic nature rather than a syntactic one

1.2.2 Database Access

The first major success of natural language processing There was a hope that databases could be controlled by

natural languages instead of complicated data retrieval commands– this was a major problem in the early 1970s since the staff in charge

of data retrieval could not keep up with demand of users for data

LUNAR system was the first such interface– built by William Woods in 1973 for NASA Manned Spacecraft

Center– system was able to correctly answer 78% of the questions such as:

“What is the average modal plagioclase concentration for lunar samples that contain rubidium?”

1.2.2 Database Access (Cont)

Other examples of data retrieval systems would include:– CHAT system

• developed by Fernando Pereira in 1983

• similar level of complexity to LUNAR system

• worked on geographical databases

• was restricted – question wording was very important

– TEAM system• could handle a wider set of problems than CHAT

• was still restricted and unable to handle all types of input

1.2.2 Database Access (Cont)

Companies such as Natural Language Inc. and Symantec are still selling database tools that use natural language

The ability to have natural language control of databases is not as big of a concern as it was in 1970s– graphical user interface and integration of spreadsheets, word

processors, graphing utilities, report generating utilities, etc are of greater concern to database buyers today

– mathematical or set notation seems to be a more natural way of communicating with a database than plane English

– with advent of SQL, the problem of data retrieval is not as major as it was in the past

1.2.3 Text Interpretation In early 1980s, most online information was stored in

databases and spreadsheets Now, most of online information is text: email, news,

journals, articles, books, encyclopedias, reports, essays, etc– there is a need to sort this information to reduce it to some

comprehendible amount

Has become a major field in natural language processing– becoming more and more important with expansion of the Internet

– consists of:• information retrieval

• text categorization

• data extraction

1.2.3.1 Information Retrieval Information retrieval (IR) is also know as information

extraction (IE) Information retrieval systems analyze unrestricted text in

order to extract specific types of information IR systems do not attempt to understand all of the text in

all of the documents, but they do analyze those portions of each document that contain relevant information– relevance is determined by pre-defined domain guidelines which

must specify, as accurately as possible, exactly what types of information the system is expected to find

• query would be a good example of such a pre-defined domain

– documents that contain relevant information are retrieved while other are ignored

1.2.3.1 Information Retrieval (Cont)

Sometimes documents could be represented by a surrogate, such as the title and and a list of key words and/or an abstract

It is more common to use the full text, possibly subdivided into sections that each serve as a separate document for retrieval purposes

The query is normally a list of words typed by the user– Boolean combinations of words were used by earlier systems to

construct queries• users found it difficult to get good results from Boolean queries

• it was hard to find a combination of “AND”s and “OR”s that will produce appropriate results

Boolean model has been replaced by vector-space model in modern IR systems– in vector-space model every list of words (both the documents and

query) is treated as a vector in n-dimensional vector space (where n is the number of distinct tokens in the document collection)

– can use a “1” in a vector position if that word appears and “0” if it does not

– vectors are then compared to determine which ones are close

– vector model is more flexible than Boolean model• documents can be ranked and closest matches could be reported first

1.2.3.1 Information Retrieval (Cont)

There are many variations on vector-space model– some allow stating that two words must appear near each other

– some use thesaurus to automatically augment the words in the query with their synonyms

A good discriminator must be chosen in order for the system to be effective– common words like “a”, “the” don’t tell us much since they occur

in just about every document

– a good way to set up the retrieval is to give a term a larger weight if it appears in a small number of documents

1.2.3.1 Information Retrieval (Cont)

Another way to think about IR is in terms of databases. An IR system attempts to convert unstructured text documents into codified database entries. Database entries might be drawn from a set of fixed values, or they can be actual sub-strings pulled from the original source text.

From a language processing perspective, IR systems must operate at many levels, from word recognition to sentence analysis, and from understanding at the sentence level on up to discourse analysis at the level of full text document.

Dictionary coverage is an especially challenging problem since open-ended documents can be filled with all manner of jargon, abbreviations, and proper names, not to mention typos and telegraphic writing styles.

1.2.3.1 Information Retrieval (Cont)

Example: (Vector-Space Model) we assume that we have one very short document that contains one sentence: “CPSC 533 is the best Computer Science course at UofC”; also assume that our query is: “UofC”– we need to set up our n-dimensional vector space: we have 10

distinct tokens (one for every word in the sentence)

– we are going to set up the following vector to represent the sentence: (1,1,1,1,1,1,1,1,1,1) -- indicating that all ten words are present

– we are going to set the following vector for the query: (0,0,0,0,0,0,0,0,0,1) -- indicating that “UofC” is the only word present in the query

– by ANDing the two vectors together, we get (0,0,0,0,0,0,0,0,0,1) meaning that our document contains “UofC”, as expected

1.2.3.1 Information Retrieval (Cont)

1.2.3.1 Information Retrieval (Cont)

Example: Commercial System (HIGHLIGHT):– helps users find relevant information in large volumes of text and

present it in a structured fashion • it can extract information from newswire reports for a specific topic

area - such as global banking, or the oil industry - as well as current and historical financial and other data

– although its accuracy will never match the decision-making skills of a trained human expert, HIGHLIGHT can process large amounts of text very quickly, allowing users to discover more information that even the most trained professional would have time to look for

– see Demo at: http://www-cgi.cam.sri.com/highlight/ – could be classified under “Extracting Data From Text (1.2.3.3)”

1.2.3.2 Text Categorization It is often desirable to sort all text into several categories There are number of companies that provide their

subscribers access to all news on a particular industry, company or geographic area– traditionally, human experts were used to assign the categories

– in the last few years, NLP systems have proven very accurate (correctly categorizing over 90% of the news stories)

Context in which text appears is very important since the same word could be categorized completely differently depending on the context– Example: in a dictionary, the primary definition of the word

“crude” is vulgar, but in a large sample of the Wall Street Journal, “crude” refers to oil 100% of the time

1.2.3.3 Extracting Data From Text

The task of data extraction is take on-line text and derive from it some assertions that can be put into a structured database

Examples of data extraction systems include:– SCISOR system

• able to take stock information text (such as the type released by Dow Jones News Service) and extract important stock information pertaining to:

– events that took place

– companies involved

– starting share prices

– quantity of shares that changed hands

– effect on stock prices

Current Topic 1.0 Definitions 1.1 Characteristics of Successful Machines 1.2 Practical Applications

– 1.2.1 machine translation

– 1.2.2 database access

– 1.2.3 text interpretation• 1.2.3.1 information retrieval

• 1.2.3.2 text categorization

• 1.2.3.3 extracting data from text

2.0 Efficient Parsing2.0 Efficient Parsing 3.0 Scaling up the Lexicon 4.0 List of References

2.0 Efficient Parsing

Parsing -- the act of analyzing the grammaticality of an utterance according to some specific grammar– previous sentence was “parsed” according to some grammar of

“English” and was determined that it was grammatical

– we read the words in some order (from left to right; from right to left; or in random order) and analyzed them one-by-one

Each parse is a different method of analyzing some target sentence according to some specified grammar

2.0 Efficient Parsing (Cont)

Simple left-to-right parsing is often insufficient– it is hard to determine the nature of the sentence

• this means that we have to make an initial guess as to what it is the sentence is saying

• this forces us to backtrack if the guess is incorrect

Some backtracking is inevitable– to make parsing efficient, we want to minimize the amount of

backtracking• even if a wrong guess is made, we know that a portion of the sentence

has already been analyzed -- there is no need to start from scratch since we can use the information that is available to us

2.0 Efficient Parsing (Cont)

Example: we have two sentences:– “Have students in section 2 of Computer Science 203 take the

exam.”

– “Have students in section 2 of Computer Science 203 taken the exam?”

• first ten words: “Have students in section 2 of Computer Science 203” are exactly the same although the meanings of the two sentences are completely different

• if an incorrect guess is made, we can still use the first ten words when we backtrack

– this will require a lot less work

2.0 Efficient Parsing (Cont)

There are three main things that we can do to improve efficiency:– don’t do twice what you can do once

– don’t do once what you can avoid altogether

– don’t represent distinctions that you don’t need

To accomplish these we can use a data structure known as chart (matrix) to store partial results– this is a form of dynamic programming

– results are only calculated if they can not be found in the chart

– only a portion of the calculations that can not be found in the chart is done while the rest is retrieved from the chart

– algorithms that do this are called chart parsers

2.0 Efficient Parsing (Cont)

Examples of parsing techniques:– Top-Down, Depth-First

– Top-Down, Breadth-First

– Bottom-Up, Depth-First Chart

– Prolog

– Feature Augmented Phrase Structure

These are not the only parsing techniques that exist One is free to come up with his or her own algorithm for

the order in which individual words in every sentence will be analyzed

2.0 Efficient Parsing (Cont)

i) Top-Down, Depth-First– uses a strategy of searching for phrasal constituents from the

highest node (the sentence node) to the terminal nodes (the individual lexical items) to find a match to the possible syntactic structure of the input sentence

– stores attempts on a possibilities list as a stacked data structure (LIFO)

ii) Top-Down, Breadth-First– same searching strategy as Top-Down, Depth-First

– stores attempts on a possibilities list as a queued data structure (FIFO)

2.0 Efficient Parsing (Cont)

iii) Bottom-Up, Depth-First Chart– parse begins at the word level and uses the grammar rules to build

higher-level structures (“bottom-up”), which are combined until a goal state is reached or until all the applicable grammar rules have been exhausted

iv) Prolog– relies on the functionality of Prolog Programming Language to

generate a parse using Top-Down, Depth-First algorithm

– naturally deals with constituents and their relationships

v) Feature Augmented Phrase Structure– takes sentence as input and parses it by accessing information in a

featured phrase-structure grammar and lexicon

– parser output is a tree

2.0 Efficient Parsing (Cont)

Chart parsing can be represented pictorially using a combination of n + 1 vertices and a number of edges

Notation for edge labels: [<Starting Vertex>,<Ending Vertex>,

<Result> <Part 1>... <Part n> • <Needed Part 1>…<Needed Part k]– if Needed Parts are added to already available Parts then Result

would be the outcome, spanning edges from Starting Vertex to Ending Vertex

– see examples (two pages down)

If there are no Needed Parts (if k = 0), then the edge is called complete– edge is called incomplete otherwise

2.0 Efficient Parsing (Cont) Chart-parsing algorithms use a combination of top-down and

bottom-up processing– this means that it never has to consider certain constituents that could

not lead to a complete parse

– this also means that it can handle grammars with both left-recursive rules and rules with empty right-hand sides without going into an infinite loop

– result of our algorithm is a packed forest of parse tree constituents rather than an enumeration of all possible trees

Chart Parsing consists of forming a chart with n + 1 vertices and adding edges to the chart one at a time, trying to produce a complete edge that spans from vertex 0 to n and is of category S (sentence) [0,n, S NP VP •] There is no backtracking -- everything that is put into the chart stays there

2.0 Efficient Parsing (Cont)

A) Edge [0,5, S NP VP •] -- says an NP followed by VP combine to make an S that spans the string from 0 to 5

B) Edge [0,2, S NP • VP] -- says that an NP spans the string from 0 to 2, and if we could find a VP to follow it, then we would have an S

2.0 Efficient Parsing (Cont) There are four ways to add and edge to the chart:

– Initializer• adds an edge to indicate that we are looking for the start symbol of the

grammar, S, starting at position 0, but have not found anything yet

– Predictor• takes an incomplete edge that is looking for an X and adds new

incomplete edges, that if completed, would build an X in the right place

– Completer• takes an incomplete edge that is looking for an X and ends at vertex j

and a complete edge that begins at j and has X as the left-hand side, and combines them to make a new edge where the X has been found

– Scanner• similar to the completer, except that it uses the input words rather than

exciting complete edges to generate the X

2.0 Efficient Parsing (Cont)

Nondeterministic Chart Parsing

Algorithm

2.0 Efficient Parsing (Cont)

Nondeterministic Chart Parsing Algorithm:– treats the chart as a set of edges

– an new edge is non-deterministically added to the chart at every step (an edge is non-deterministically chosen from the possible additions)

– S is the start symbol and S’ is the new nonterminal symbol• we start out looking for S (i.e. we currently have an empty string)

– add edges using one of the three methods (predictor, completer, scanner), one at a time until no new edges can be added

– at the end, if the required parse exists, it is found

– if none of the methods could be used to add another edge to the set, the algorithm terminates

2.0 Efficient Parsing (Cont)

Chart for a Parse of: “I feel it”

2.0 Efficient Parsing (Cont) Using the sample chart on the previous page, the following

steps are taken to complete the parse of “I feel it” -- page 1/3:– 1. INITIALIZER: if we parse from edge 0 to edge 0 and look for S’, we

still need to find S -- (a)

– 2. PREDICTOR: we are looking for an incomplete edge, that if completed, would give us S -- we know that S consists of NP and VP, meaning that by going from 0 to 0 we will have S if we find VP and NP -- (b)

– 3. PREDICTOR: following a very similar rule, we know that we will have NP if we can find a Pronoun ; this condition can be achieved by going from 0 to 0, looking for a Pronoun -- (c)

– 4. SCANNER: if we go from 0 to 1, parsing “I” we will have our NP since a Pronoun is found -- (d)

2.0 Efficient Parsing (Cont) Example (continued) -- page 2/3:

– 5. COMPLETER: we can summarize above steps, we are looking for S and by going from 0 to1 we have NP and are still looking for VP -- (e)

– 6. PREDICTOR: we are now looking for VP and by going from 1 to 1 we will have VP if can find a Verb -- (f)

– 7. PREDICTOR: VP can consist of another VP and NP, meaning that 6 would also work if we can find VP and NP -- (g)

– 8. SCANNER: by going from1 to 2 we can find a Verb, thus we can find VP -- (h)

– 9. COMPLETER: using 7 and 8, we know that since VP is found we can complete VP by going from 1 to 2 and finding NP -- (i)

– 10. PREDICTOR: NP can be completed by going from 2 to 2 and finding a Pronoun -- (j)

2.0 Efficient Parsing (Cont)

Example (continued) -- page 3/3:– 11. SCANNER: we can find a Pronoun if we go from 2 to 3, thus

completing NP -- (k)

– 12. COMPLETER: using 7 - 11, we know that VP can be found by going from 1 to 3, thus finding NP and VP -- (l)

– 13. COMPLETER: using all of the information we collected up to this point, one can get S by going from 0 to 3, thus finding the original NP and VP, where VP consists of another VP and NP -- (m)

All of these steps are summarized on the diagram on the next page

2.0 Efficient Parsing (Cont)Trace of a Parse of “I feel it”

2.0 Efficient Parsing (Cont)Left-Corner Parsing Algorithm

2.0 Efficient Parsing (Cont)

Left-Corner Parsing:– avoids building some edges that could not possibly be part of an S

spanning the whole string

– builds up a parse tree that starts with the grammar’s start symbol and extends down to the last word in the sentence

– Non-deterministic Chart Parsing Algorithm is an example of left-corner parsers

– using example on the previous slide:• “ride the horse” would never be considered as VP

– saves time since unrealistic combinations do not have to be, first worked out and then discarded

2.0 Efficient Parsing (Cont) Extracting Parses From the Chart: Packing

– when the chart parsing algorithm finishes, it returns an entire chart (collection of parse trees)

– what we really want is a parse tree (or several parse trees)

– Ex: • a) pick out parse trees that span the entire input

• b) pick out parse trees that for some reason do not span the entire input

– the easiest way to do this is to modify COMPLETER so that when it combines two child edges to produce a parent edge, it stores in the parent edge the list of children that comprise it.

– when we are done with the parse, we only need to look in chart[n] for an edge that starts at 0, and recursively look at the children lists to reproduce a complete parse tree

2.0 Efficient Parsing (Cont)A Variant of

Nondeterministic Chart Parsing

Algorithm

Keeps track of the entire parse tree

We can look in chart[n] for an edge that starts at 0, and recursively look at the children lists to reproduce a complete parse tree

Current Topic 1.0 Definitions 1.1 Characteristics of Successful Machines 1.2 Practical Applications

– 1.2.1 machine translation

– 1.2.2 database access

– 1.2.3 text interpretation• 1.2.3.1 information retrieval

• 1.2.3.2 text categorization

• 1.2.3.3 extracting data from text

2.0 Efficient Parsing 3.0 Scaling up the Lexicon3.0 Scaling up the Lexicon 4.0 List of References

3.0 Scaling Up the Lexicon In real text-understanding systems, the input is a sequence

of characters from which the words must be extracted Four step process for doing this consists of:

– tokenization

– morphological analysis

– dictionary lookup

– error recovery

Since many natural languages are fundamentally different, these steps would be much harder to apply to some languages than others

3.0 Scaling Up the Lexicon (Cont) a) Tokenization

– process of dividing the input into distinct tokens -- words and punctuation marks.

– this is not easy in some languages , like Japanese, where there are no spaces between words

– this process is much easier in English although it is not trivial by any means

– examples of complications may include:• A hyphen at the end of the line may be an interword or an intraword

dash

– tokenization routines are designed to be fast, with the idea that as long as they are consistent in breaking up the input text into tokens, any problems can always be handled at some later stage of processing

b) Morphological Analysis– the process of describing a word in terms of the prefixes, suffixes

and root forms that comprise it

– there are three ways that words can be composed:• Inflectional Morphology

– reflects that changes to a word that are needed in a particular grammatical context (Ex: most nouns take the suffix “s” when they are plural)

• Derivational Morphology– derives a new word from another word that is usually of a different

category (Ex: the noun “softness” is derived from the adjective “short”)

• Compounding– takes two words and puts them together (Ex: “bookkeeper” is a

compound of “book” and “keeper”)

– used a lot in morphologically complex languages such as German, Finish, Turkish, Inuit, and Yupik

3.0 Scaling Up the Lexicon (Cont)

c) Dictionary Lookup– is performed on every token (except for special ones such as

punctuation)

– the task is to find the word in the dictionary and return its definition

– two ways to do dictionary lookup:• store morphologically complex words first:

– complex words are written to dictionary and the looked up when needed

• do morphological analysis first:– process the word before looking anything up

– Ex: “walked” -- strip of “ed” and look up “walk”

» if the verb is not marked as irregular, then “walked” would be the past tense of “walk”

– any implementation of the table abstract data type can serve as a dictionary: hash tables, binary trees, b-tries, and trees

3.0 Scaling Up the Lexicon (Cont)

d) Error Recovery– is undertaken when a word is not found in the dictionary

– there are four types of error recovery:• morphological rules can guess at the word’s syntactic class

– Ex: “smarply” is not in the dictionary but it is probably an adverb

• capitalization is a clue that a word is a proper name

• other specialized formats denote dates, times, social security numbers, etc

• spelling correction routines can be used to find a word in the dictionary that is close to the input word

– there are two popular models for defining “closeness” in words:

» Letter-Based Model

» Sound-Based Model

3.0 Scaling Up the Lexicon (Cont)

Letter-Based Model– an error consists of inserting or deleting a single letter, transposing

two adjacent letters or replacing one letter with another

– Ex: a 10 letter word is one error away from 530 other words:• 10 deletions -- each of the ten letters could be deleted

• 9 swaps -- _x_x_x_x_x_x_x_x_x_ there are nine possible swaps where “x” signifies that “_” on its left and right could be switched

• 10 x 25 replacements -- each of the ten letters can be replaced by (26 - 1) letters of the alphabet

• 11 x 26 insertions -- x_x_x_x_x_x_x_x_x_x_x and each “x” can be one of the 26 letters of the alphabet

• total is = 10 + 9 + 225 + 286 = 530

3.0 Scaling Up the Lexicon (Cont)

Sound-Based Model– words are translated into canonical form that preserves most of

information needed to pronounce the word, but abstracts away the details

– Ex: a word such as “attention” might be translated into the sequence [a, T, a, N, S, H, a, N], where “a” stands for any vowel

• this would mean that words such as “attension” and “atennshun” translate to the same sequence

• if no other word in the dictionary translates into the same sequence, then we can unambiguously correct the spelling error

• NOTE: letter-based approach would work just as well for “attention” but not for “atennshun”, which is 5 errors away from “attention”

3.0 Scaling Up the Lexicon (Cont)

Practical NPL systems have lexicons with from 10,000 to 1000,000 root word forms– building such a sizable lexicon is very time consuming and

expensive• this has been a cost that dictionary publishing companies and

companies with NLP programs have not been willing to share

Wordnet is an exception to this rule:– freely available dictionary, developed by a group at Princeton (led

by George Miller)

– diagram on the next slide gives and example of the type of information returned by Wordnet about the word “ride”

3.0 Scaling Up the Lexicon (Cont)

3.0 Scaling Up the Lexicon (Cont)Wordnet

Example of the Word

“ride”

Although dictionaries like Wordnet are useful, they do not provide all the lexical information one would like– frequency information is missing

• some of the meanings are far more likely than others

• Ex: “pen” usually means a writing instrument although (very rarely) it can mean a female swan

– semantic restrictions are missing• we need to know related information

• Ex: with the word “ride”, we may need to know whether we are talking about animals or vehicles because the actions in two cases are quite different

3.0 Scaling Up the Lexicon (Cont)

Current Topic 1.0 Definitions 1.1 Characteristics of Successful Machines 1.2 Practical Applications

– 1.2.1 machine translation

– 1.2.2 database access

– 1.2.3 text interpretation• 1.2.3.1 information retrieval

• 1.2.3.2 text categorization

• 1.2.3.3 extracting data from text

2.0 Efficient Parsing 3.0 Scaling up the Lexicon 4.0 List of References4.0 List of References

4.0 List of References

http://nats-www.informatik.uni-hamburg.de/ Natural Language Systems

http://www.he.net/~hedden/intro_mt.html Machine Translation: A Brief Introduction

http://foxnet.cs.cmu.edu/people/spot/frg/Tomita.txt Masaru Tomita

http://www.csli.stanford.edu/~aac/papers.html Ann Copestake's Online Publications

http://www.aventinus.de/ AVENTINUS advanced information system for multilingual drug enforcement

4.0 List of References (Cont)

http://ai10.bpa.arizona.edu/~ktolle/np.html AZ Noun Phraser

http://www.cam.sri.com/ Cambridge Computer Science Research Center

http://www-cgi.cam.sri.com/highlight/ Cambridge Computer Science Research Center, Highlight

http://www.cogs.susx.ac.uk/lab/nlp/ Natural Language Processing and Computational Linguistics at The

University of Sussex http://www.cogs.susx.ac.uk/lab/nlp/lexsys/ LexSys:

Analysis of Naturally-Occurring English Text with Stochastic Lexicalized Grammars

4.0 List of References (Cont)

http://www.georgetown.edu/compling/parsinfo.htm Georgetown University: General Description of Parsers

http://www.georgetown.edu/compling/graminfo.htm Georgetown University: General Information about Grammars

http://www.georgetown.edu/cball/ling361/ling361_nlp1.html Georgetown University: Introduction to Computational Linguistics

http://www.georgetown.edu/compling/module.html Georgetown University: Modularity in Natural Language Parsing

4.0 List of References (Cont)

Elaine Rich, Kevin Knight Artificial Intelligence Patrick Henry Winston Artificial Intelligence Philip C. Jackson Introduction to Artificial Intelligence