by tim adrian gareau edward dantsiguer. agenda 4 1.0 definitions 4 1.1 characteristics of successful...
Post on 22-Dec-2015
213 Views
Preview:
TRANSCRIPT
Agenda 1.0 Definitions 1.1 Characteristics of Successful Machines 1.2 Practical Applications
– 1.2.1 machine translation
– 1.2.2 database access
– 1.2.3 text interpretation• 1.2.3.1 information retrieval
• 1.2.3.2 text categorization
• 1.2.3.3 extracting data from text
2.0 Efficient Parsing 3.0 Scaling up the Lexicon 4.0 List of References
Current Topic 1.0 Definitions1.0 Definitions 1.1 Characteristics of Successful Machines 1.2 Practical Applications
– 1.2.1 machine translation
– 1.2.2 database access
– 1.2.3 text interpretation• 1.2.3.1 information retrieval
• 1.2.3.2 text categorization
• 1.2.3.3 extracting data from text
2.0 Efficient Parsing 3.0 Scaling up the Lexicon 4.0 List of References
1.0 Definitions
Natural languages are languages that living creatures use for communication
Artificial Languages are mathematically defined classes of signals that can be used for communication with machines
A language is a set of sentences that may be used as signals to convey semantic information
The meaning of a sentence is the semantic information it conveys
Current Topic 1.0 Definitions 1.1 Characteristics of Successful Machines1.1 Characteristics of Successful Machines 1.2 Practical Applications
– 1.2.1 machine translation
– 1.2.2 database access
– 1.2.3 text interpretation• 1.2.3.1 information retrieval
• 1.2.3.2 text categorization
• 1.2.3.3 extracting data from text
2.0 Efficient Parsing 3.0 Scaling up the Lexicon 4.0 List of References
1.1 Characteristics of Successful Natural Language Systems Successful systems share two properties:
– they are focused on a particular domain rather than allowing discussion of any topic
– they are focused on a particular task rather than attempting to understand language completely
The above means that any natural language machine is more likely to work correctly if one is to restrict the set of possible inputs -- input possibility size is inversely proportional to likelihood of success
Current Topic 1.0 Definitions 1.1 Characteristics of Successful Machines 1.2 Practical Applications1.2 Practical Applications
– 1.2.1 machine translation 1.2.1 machine translation
– 1.2.2 database access 1.2.2 database access
– 1.2.3 text interpretation1.2.3 text interpretation• 1.2.3.1 information retrieval 1.2.3.1 information retrieval
• 1.2.3.2 text categorization 1.2.3.2 text categorization
• 1.2.3.3 extracting data from text1.2.3.3 extracting data from text
2.0 Efficient Parsing 3.0 Scaling up the Lexicon 4.0 List of References
1.2 Practical Applications
We are going to look at five practical applications of natural language processing:– machine translation (1.2.1)
– database access (1.2.2)
– text interpretation (1.2.3)• information retrieval (1.2.3.1)
• text categorization (1.2.3.2)
• extracting data from text (1.2.3.3)
1.2.1 Machine Translation First suggestions made by the Russian Smirnov-Troyansky
and the Frenchman C.G. Artsouni in the 1930’s First serious discussions were begun in 1946 by
mathematician Warren Weaver– There was great hope that computers would be able to translate from
one natural language to another (inspired by the success of the Allied efforts using the British Colossus computer)
• Turing’s project “translated” coded messages into intelligible German
By 1954 there was a machine translation (MT) project at Georgetown University– succeeded in correctly translating several sentences from Russian into
English
After Georgetown project, MT projects were started up at MIT, Harvard and the University of Pennsylvania
1.2.1 Machine Translation (Cont) It soon (1966) became apparent that translation is a very
complicated task and it would be practically impossible to account for all intricacies and nuances of natural languages– correct translation would require an in-depth understanding of both
natural languages since structure of expressions varies in every natural language
– Yehoshua Bar-Hillel declared that MT was impossible (Bar-Hillel Paradox):
• analysis by humans of messages relies to some extent on the information which is not present in the words that make up the message
– “The pen is in the box”
» [i.e. the writing instrument is in the container]
– “The box is in the pen”
» [i.e. the container is in the playpen or the pigpen]
1.2.1 Machine Translation (Cont)
There has been no fundamental breakthroughs in machine translation in the last 34 years
Progress has been made on restricted domains – there are dozens of systems that are able to take a subset of one
language and, fairly accurately, translate it to another language – these systems operate well enough to save significant sums of
money over fully manual techniques (see examples two pages down)
From the above systems, ones operating on a more restricted set, produce more impressive results
Machine translation is NOT automatic speech recognition
1.2.1 Machine Translation (Cont)
Examples of poor machine translations would include:– "the spirit is strong, but the body is weak" was translated literally
as "the vodka is strong but the meat is rotten”
– "Out of sight, out of mind” was translated as "Invisible, insane”
– "hydraulic ram” was translated as "male water sheep”
These do not imply that machine translation is a waste of time– some mistakes are inevitable regardless of the quality and
sophistication of the system
– one has to realize that human translators also make mistakes
1.2.1 Machine Translation (Cont) Examples machine translation systems include:
– TAUM-METRO system• translates weather reports from English to French
• works very well since language in government weather reports is highly stylized and regular
– SPANAM system• translates Spanish into English
• worked on a more open domain
• results were reasonably good although resulting English text was not always grammatical and very rarely fluent
– AVENTINUS system• advanced information system for multilingual drug enforcement
• allows law enforcement officials to know what the foreign document is about
• sorts, classifies and analyzes drug related information
1.2.1 Machine Translation (Cont)
There are three basic types of machine translation:– Machine-assisted (aided) human translation (MAHT)
• the translation is performed by human translator, but he/she uses a computer as a tool to improve or speed up the translation process
– Human-assisted (aided) machine translation (HAMT)• the source language text is modified by human translator either
before, during or after it is translated by the computer
– Fully automatic machine translation (FAMT)• the source language text is fed into the computer as a file, and the
computer produces a translation automatically without any human intervention
1.2.1 Machine Translation (Cont)
Standing on its own, unrestricted machine translation (FAMT) is still inadequate– Human-assisted machine translation (HAMT) could be used to
improve the quality of translation • one possibility is to have a human reader go over the text after the
translation, correcting grammar errors (post-processing)– human reader can save a lot of time since some of the text will be
translated correctly
– sometimes a monolingual human can edit the output without reading the original
• another possibility is to have a human reader edit the document before translation (pre-processing)
– make the original to conform to a restricted subset of a language
– this will usually allow the system to translate the resulting text without any requirement for post-editing
1.2.1 Machine Translation (Cont)
Restricted languages are sometimes called “Caterpillar English”– Caterpillar was the first company to try writing their manuals
using pre-processing
– Xerox was the first company to really successfully use of the pre-processing approach (SYSTRAN system)
• language defined for their manuals was highly restricted, thus translation into other languages worked quite well
There is a substantial start-up cost to any machine translation effort– to achieve broad coverage, translation systems should have
lexicons of 20,000 to 100,000 words and grammars of 100 to 10,000 rules (depending on the choice of formalism)
1.2.1 Machine Translation (Cont) There are several basic theoretical approaches to machine
translation:– Direct MT Strategy
• based on good glossaries and morphological analysis
• always between a pair of languages
– Transfer MT Strategy• first, source language is parsed into an abstract internal representation
• a ‘transfer’ is then made into the corresponding structures in the target language
– Inerlingua MT Strategy• the idea is to create an artificial language
– it shares all the features and makes all the distinctions of all languages
– Knowledge-Based Strategy• similar to the above
• intermediate form is of semantic nature rather than a syntactic one
1.2.2 Database Access
The first major success of natural language processing There was a hope that databases could be controlled by
natural languages instead of complicated data retrieval commands– this was a major problem in the early 1970s since the staff in charge
of data retrieval could not keep up with demand of users for data
LUNAR system was the first such interface– built by William Woods in 1973 for NASA Manned Spacecraft
Center– system was able to correctly answer 78% of the questions such as:
“What is the average modal plagioclase concentration for lunar samples that contain rubidium?”
1.2.2 Database Access (Cont)
Other examples of data retrieval systems would include:– CHAT system
• developed by Fernando Pereira in 1983
• similar level of complexity to LUNAR system
• worked on geographical databases
• was restricted – question wording was very important
– TEAM system• could handle a wider set of problems than CHAT
• was still restricted and unable to handle all types of input
1.2.2 Database Access (Cont)
Companies such as Natural Language Inc. and Symantec are still selling database tools that use natural language
The ability to have natural language control of databases is not as big of a concern as it was in 1970s– graphical user interface and integration of spreadsheets, word
processors, graphing utilities, report generating utilities, etc are of greater concern to database buyers today
– mathematical or set notation seems to be a more natural way of communicating with a database than plane English
– with advent of SQL, the problem of data retrieval is not as major as it was in the past
1.2.3 Text Interpretation In early 1980s, most online information was stored in
databases and spreadsheets Now, most of online information is text: email, news,
journals, articles, books, encyclopedias, reports, essays, etc– there is a need to sort this information to reduce it to some
comprehendible amount
Has become a major field in natural language processing– becoming more and more important with expansion of the Internet
– consists of:• information retrieval
• text categorization
• data extraction
1.2.3.1 Information Retrieval Information retrieval (IR) is also know as information
extraction (IE) Information retrieval systems analyze unrestricted text in
order to extract specific types of information IR systems do not attempt to understand all of the text in
all of the documents, but they do analyze those portions of each document that contain relevant information– relevance is determined by pre-defined domain guidelines which
must specify, as accurately as possible, exactly what types of information the system is expected to find
• query would be a good example of such a pre-defined domain
– documents that contain relevant information are retrieved while other are ignored
1.2.3.1 Information Retrieval (Cont)
Sometimes documents could be represented by a surrogate, such as the title and and a list of key words and/or an abstract
It is more common to use the full text, possibly subdivided into sections that each serve as a separate document for retrieval purposes
The query is normally a list of words typed by the user– Boolean combinations of words were used by earlier systems to
construct queries• users found it difficult to get good results from Boolean queries
• it was hard to find a combination of “AND”s and “OR”s that will produce appropriate results
Boolean model has been replaced by vector-space model in modern IR systems– in vector-space model every list of words (both the documents and
query) is treated as a vector in n-dimensional vector space (where n is the number of distinct tokens in the document collection)
– can use a “1” in a vector position if that word appears and “0” if it does not
– vectors are then compared to determine which ones are close
– vector model is more flexible than Boolean model• documents can be ranked and closest matches could be reported first
1.2.3.1 Information Retrieval (Cont)
There are many variations on vector-space model– some allow stating that two words must appear near each other
– some use thesaurus to automatically augment the words in the query with their synonyms
A good discriminator must be chosen in order for the system to be effective– common words like “a”, “the” don’t tell us much since they occur
in just about every document
– a good way to set up the retrieval is to give a term a larger weight if it appears in a small number of documents
1.2.3.1 Information Retrieval (Cont)
Another way to think about IR is in terms of databases. An IR system attempts to convert unstructured text documents into codified database entries. Database entries might be drawn from a set of fixed values, or they can be actual sub-strings pulled from the original source text.
From a language processing perspective, IR systems must operate at many levels, from word recognition to sentence analysis, and from understanding at the sentence level on up to discourse analysis at the level of full text document.
Dictionary coverage is an especially challenging problem since open-ended documents can be filled with all manner of jargon, abbreviations, and proper names, not to mention typos and telegraphic writing styles.
1.2.3.1 Information Retrieval (Cont)
Example: (Vector-Space Model) we assume that we have one very short document that contains one sentence: “CPSC 533 is the best Computer Science course at UofC”; also assume that our query is: “UofC”– we need to set up our n-dimensional vector space: we have 10
distinct tokens (one for every word in the sentence)
– we are going to set up the following vector to represent the sentence: (1,1,1,1,1,1,1,1,1,1) -- indicating that all ten words are present
– we are going to set the following vector for the query: (0,0,0,0,0,0,0,0,0,1) -- indicating that “UofC” is the only word present in the query
– by ANDing the two vectors together, we get (0,0,0,0,0,0,0,0,0,1) meaning that our document contains “UofC”, as expected
1.2.3.1 Information Retrieval (Cont)
1.2.3.1 Information Retrieval (Cont)
Example: Commercial System (HIGHLIGHT):– helps users find relevant information in large volumes of text and
present it in a structured fashion • it can extract information from newswire reports for a specific topic
area - such as global banking, or the oil industry - as well as current and historical financial and other data
– although its accuracy will never match the decision-making skills of a trained human expert, HIGHLIGHT can process large amounts of text very quickly, allowing users to discover more information that even the most trained professional would have time to look for
– see Demo at: http://www-cgi.cam.sri.com/highlight/ – could be classified under “Extracting Data From Text (1.2.3.3)”
1.2.3.2 Text Categorization It is often desirable to sort all text into several categories There are number of companies that provide their
subscribers access to all news on a particular industry, company or geographic area– traditionally, human experts were used to assign the categories
– in the last few years, NLP systems have proven very accurate (correctly categorizing over 90% of the news stories)
Context in which text appears is very important since the same word could be categorized completely differently depending on the context– Example: in a dictionary, the primary definition of the word
“crude” is vulgar, but in a large sample of the Wall Street Journal, “crude” refers to oil 100% of the time
1.2.3.3 Extracting Data From Text
The task of data extraction is take on-line text and derive from it some assertions that can be put into a structured database
Examples of data extraction systems include:– SCISOR system
• able to take stock information text (such as the type released by Dow Jones News Service) and extract important stock information pertaining to:
– events that took place
– companies involved
– starting share prices
– quantity of shares that changed hands
– effect on stock prices
Current Topic 1.0 Definitions 1.1 Characteristics of Successful Machines 1.2 Practical Applications
– 1.2.1 machine translation
– 1.2.2 database access
– 1.2.3 text interpretation• 1.2.3.1 information retrieval
• 1.2.3.2 text categorization
• 1.2.3.3 extracting data from text
2.0 Efficient Parsing2.0 Efficient Parsing 3.0 Scaling up the Lexicon 4.0 List of References
2.0 Efficient Parsing
Parsing -- the act of analyzing the grammaticality of an utterance according to some specific grammar– previous sentence was “parsed” according to some grammar of
“English” and was determined that it was grammatical
– we read the words in some order (from left to right; from right to left; or in random order) and analyzed them one-by-one
Each parse is a different method of analyzing some target sentence according to some specified grammar
2.0 Efficient Parsing (Cont)
Simple left-to-right parsing is often insufficient– it is hard to determine the nature of the sentence
• this means that we have to make an initial guess as to what it is the sentence is saying
• this forces us to backtrack if the guess is incorrect
Some backtracking is inevitable– to make parsing efficient, we want to minimize the amount of
backtracking• even if a wrong guess is made, we know that a portion of the sentence
has already been analyzed -- there is no need to start from scratch since we can use the information that is available to us
2.0 Efficient Parsing (Cont)
Example: we have two sentences:– “Have students in section 2 of Computer Science 203 take the
exam.”
– “Have students in section 2 of Computer Science 203 taken the exam?”
• first ten words: “Have students in section 2 of Computer Science 203” are exactly the same although the meanings of the two sentences are completely different
• if an incorrect guess is made, we can still use the first ten words when we backtrack
– this will require a lot less work
2.0 Efficient Parsing (Cont)
There are three main things that we can do to improve efficiency:– don’t do twice what you can do once
– don’t do once what you can avoid altogether
– don’t represent distinctions that you don’t need
To accomplish these we can use a data structure known as chart (matrix) to store partial results– this is a form of dynamic programming
– results are only calculated if they can not be found in the chart
– only a portion of the calculations that can not be found in the chart is done while the rest is retrieved from the chart
– algorithms that do this are called chart parsers
2.0 Efficient Parsing (Cont)
Examples of parsing techniques:– Top-Down, Depth-First
– Top-Down, Breadth-First
– Bottom-Up, Depth-First Chart
– Prolog
– Feature Augmented Phrase Structure
These are not the only parsing techniques that exist One is free to come up with his or her own algorithm for
the order in which individual words in every sentence will be analyzed
2.0 Efficient Parsing (Cont)
i) Top-Down, Depth-First– uses a strategy of searching for phrasal constituents from the
highest node (the sentence node) to the terminal nodes (the individual lexical items) to find a match to the possible syntactic structure of the input sentence
– stores attempts on a possibilities list as a stacked data structure (LIFO)
ii) Top-Down, Breadth-First– same searching strategy as Top-Down, Depth-First
– stores attempts on a possibilities list as a queued data structure (FIFO)
2.0 Efficient Parsing (Cont)
iii) Bottom-Up, Depth-First Chart– parse begins at the word level and uses the grammar rules to build
higher-level structures (“bottom-up”), which are combined until a goal state is reached or until all the applicable grammar rules have been exhausted
iv) Prolog– relies on the functionality of Prolog Programming Language to
generate a parse using Top-Down, Depth-First algorithm
– naturally deals with constituents and their relationships
v) Feature Augmented Phrase Structure– takes sentence as input and parses it by accessing information in a
featured phrase-structure grammar and lexicon
– parser output is a tree
2.0 Efficient Parsing (Cont)
Chart parsing can be represented pictorially using a combination of n + 1 vertices and a number of edges
Notation for edge labels: [<Starting Vertex>,<Ending Vertex>,
<Result> <Part 1>... <Part n> • <Needed Part 1>…<Needed Part k]– if Needed Parts are added to already available Parts then Result
would be the outcome, spanning edges from Starting Vertex to Ending Vertex
– see examples (two pages down)
If there are no Needed Parts (if k = 0), then the edge is called complete– edge is called incomplete otherwise
2.0 Efficient Parsing (Cont) Chart-parsing algorithms use a combination of top-down and
bottom-up processing– this means that it never has to consider certain constituents that could
not lead to a complete parse
– this also means that it can handle grammars with both left-recursive rules and rules with empty right-hand sides without going into an infinite loop
– result of our algorithm is a packed forest of parse tree constituents rather than an enumeration of all possible trees
Chart Parsing consists of forming a chart with n + 1 vertices and adding edges to the chart one at a time, trying to produce a complete edge that spans from vertex 0 to n and is of category S (sentence) [0,n, S NP VP •] There is no backtracking -- everything that is put into the chart stays there
2.0 Efficient Parsing (Cont)
A) Edge [0,5, S NP VP •] -- says an NP followed by VP combine to make an S that spans the string from 0 to 5
B) Edge [0,2, S NP • VP] -- says that an NP spans the string from 0 to 2, and if we could find a VP to follow it, then we would have an S
2.0 Efficient Parsing (Cont) There are four ways to add and edge to the chart:
– Initializer• adds an edge to indicate that we are looking for the start symbol of the
grammar, S, starting at position 0, but have not found anything yet
– Predictor• takes an incomplete edge that is looking for an X and adds new
incomplete edges, that if completed, would build an X in the right place
– Completer• takes an incomplete edge that is looking for an X and ends at vertex j
and a complete edge that begins at j and has X as the left-hand side, and combines them to make a new edge where the X has been found
– Scanner• similar to the completer, except that it uses the input words rather than
exciting complete edges to generate the X
2.0 Efficient Parsing (Cont)
Nondeterministic Chart Parsing Algorithm:– treats the chart as a set of edges
– an new edge is non-deterministically added to the chart at every step (an edge is non-deterministically chosen from the possible additions)
– S is the start symbol and S’ is the new nonterminal symbol• we start out looking for S (i.e. we currently have an empty string)
– add edges using one of the three methods (predictor, completer, scanner), one at a time until no new edges can be added
– at the end, if the required parse exists, it is found
– if none of the methods could be used to add another edge to the set, the algorithm terminates
2.0 Efficient Parsing (Cont) Using the sample chart on the previous page, the following
steps are taken to complete the parse of “I feel it” -- page 1/3:– 1. INITIALIZER: if we parse from edge 0 to edge 0 and look for S’, we
still need to find S -- (a)
– 2. PREDICTOR: we are looking for an incomplete edge, that if completed, would give us S -- we know that S consists of NP and VP, meaning that by going from 0 to 0 we will have S if we find VP and NP -- (b)
– 3. PREDICTOR: following a very similar rule, we know that we will have NP if we can find a Pronoun ; this condition can be achieved by going from 0 to 0, looking for a Pronoun -- (c)
– 4. SCANNER: if we go from 0 to 1, parsing “I” we will have our NP since a Pronoun is found -- (d)
2.0 Efficient Parsing (Cont) Example (continued) -- page 2/3:
– 5. COMPLETER: we can summarize above steps, we are looking for S and by going from 0 to1 we have NP and are still looking for VP -- (e)
– 6. PREDICTOR: we are now looking for VP and by going from 1 to 1 we will have VP if can find a Verb -- (f)
– 7. PREDICTOR: VP can consist of another VP and NP, meaning that 6 would also work if we can find VP and NP -- (g)
– 8. SCANNER: by going from1 to 2 we can find a Verb, thus we can find VP -- (h)
– 9. COMPLETER: using 7 and 8, we know that since VP is found we can complete VP by going from 1 to 2 and finding NP -- (i)
– 10. PREDICTOR: NP can be completed by going from 2 to 2 and finding a Pronoun -- (j)
2.0 Efficient Parsing (Cont)
Example (continued) -- page 3/3:– 11. SCANNER: we can find a Pronoun if we go from 2 to 3, thus
completing NP -- (k)
– 12. COMPLETER: using 7 - 11, we know that VP can be found by going from 1 to 3, thus finding NP and VP -- (l)
– 13. COMPLETER: using all of the information we collected up to this point, one can get S by going from 0 to 3, thus finding the original NP and VP, where VP consists of another VP and NP -- (m)
All of these steps are summarized on the diagram on the next page
2.0 Efficient Parsing (Cont)
Left-Corner Parsing:– avoids building some edges that could not possibly be part of an S
spanning the whole string
– builds up a parse tree that starts with the grammar’s start symbol and extends down to the last word in the sentence
– Non-deterministic Chart Parsing Algorithm is an example of left-corner parsers
– using example on the previous slide:• “ride the horse” would never be considered as VP
– saves time since unrealistic combinations do not have to be, first worked out and then discarded
2.0 Efficient Parsing (Cont) Extracting Parses From the Chart: Packing
– when the chart parsing algorithm finishes, it returns an entire chart (collection of parse trees)
– what we really want is a parse tree (or several parse trees)
– Ex: • a) pick out parse trees that span the entire input
• b) pick out parse trees that for some reason do not span the entire input
– the easiest way to do this is to modify COMPLETER so that when it combines two child edges to produce a parent edge, it stores in the parent edge the list of children that comprise it.
– when we are done with the parse, we only need to look in chart[n] for an edge that starts at 0, and recursively look at the children lists to reproduce a complete parse tree
2.0 Efficient Parsing (Cont)A Variant of
Nondeterministic Chart Parsing
Algorithm
Keeps track of the entire parse tree
We can look in chart[n] for an edge that starts at 0, and recursively look at the children lists to reproduce a complete parse tree
Current Topic 1.0 Definitions 1.1 Characteristics of Successful Machines 1.2 Practical Applications
– 1.2.1 machine translation
– 1.2.2 database access
– 1.2.3 text interpretation• 1.2.3.1 information retrieval
• 1.2.3.2 text categorization
• 1.2.3.3 extracting data from text
2.0 Efficient Parsing 3.0 Scaling up the Lexicon3.0 Scaling up the Lexicon 4.0 List of References
3.0 Scaling Up the Lexicon In real text-understanding systems, the input is a sequence
of characters from which the words must be extracted Four step process for doing this consists of:
– tokenization
– morphological analysis
– dictionary lookup
– error recovery
Since many natural languages are fundamentally different, these steps would be much harder to apply to some languages than others
3.0 Scaling Up the Lexicon (Cont) a) Tokenization
– process of dividing the input into distinct tokens -- words and punctuation marks.
– this is not easy in some languages , like Japanese, where there are no spaces between words
– this process is much easier in English although it is not trivial by any means
– examples of complications may include:• A hyphen at the end of the line may be an interword or an intraword
dash
– tokenization routines are designed to be fast, with the idea that as long as they are consistent in breaking up the input text into tokens, any problems can always be handled at some later stage of processing
b) Morphological Analysis– the process of describing a word in terms of the prefixes, suffixes
and root forms that comprise it
– there are three ways that words can be composed:• Inflectional Morphology
– reflects that changes to a word that are needed in a particular grammatical context (Ex: most nouns take the suffix “s” when they are plural)
• Derivational Morphology– derives a new word from another word that is usually of a different
category (Ex: the noun “softness” is derived from the adjective “short”)
• Compounding– takes two words and puts them together (Ex: “bookkeeper” is a
compound of “book” and “keeper”)
– used a lot in morphologically complex languages such as German, Finish, Turkish, Inuit, and Yupik
3.0 Scaling Up the Lexicon (Cont)
c) Dictionary Lookup– is performed on every token (except for special ones such as
punctuation)
– the task is to find the word in the dictionary and return its definition
– two ways to do dictionary lookup:• store morphologically complex words first:
– complex words are written to dictionary and the looked up when needed
• do morphological analysis first:– process the word before looking anything up
– Ex: “walked” -- strip of “ed” and look up “walk”
» if the verb is not marked as irregular, then “walked” would be the past tense of “walk”
– any implementation of the table abstract data type can serve as a dictionary: hash tables, binary trees, b-tries, and trees
3.0 Scaling Up the Lexicon (Cont)
d) Error Recovery– is undertaken when a word is not found in the dictionary
– there are four types of error recovery:• morphological rules can guess at the word’s syntactic class
– Ex: “smarply” is not in the dictionary but it is probably an adverb
• capitalization is a clue that a word is a proper name
• other specialized formats denote dates, times, social security numbers, etc
• spelling correction routines can be used to find a word in the dictionary that is close to the input word
– there are two popular models for defining “closeness” in words:
» Letter-Based Model
» Sound-Based Model
3.0 Scaling Up the Lexicon (Cont)
Letter-Based Model– an error consists of inserting or deleting a single letter, transposing
two adjacent letters or replacing one letter with another
– Ex: a 10 letter word is one error away from 530 other words:• 10 deletions -- each of the ten letters could be deleted
• 9 swaps -- _x_x_x_x_x_x_x_x_x_ there are nine possible swaps where “x” signifies that “_” on its left and right could be switched
• 10 x 25 replacements -- each of the ten letters can be replaced by (26 - 1) letters of the alphabet
• 11 x 26 insertions -- x_x_x_x_x_x_x_x_x_x_x and each “x” can be one of the 26 letters of the alphabet
• total is = 10 + 9 + 225 + 286 = 530
3.0 Scaling Up the Lexicon (Cont)
Sound-Based Model– words are translated into canonical form that preserves most of
information needed to pronounce the word, but abstracts away the details
– Ex: a word such as “attention” might be translated into the sequence [a, T, a, N, S, H, a, N], where “a” stands for any vowel
• this would mean that words such as “attension” and “atennshun” translate to the same sequence
• if no other word in the dictionary translates into the same sequence, then we can unambiguously correct the spelling error
• NOTE: letter-based approach would work just as well for “attention” but not for “atennshun”, which is 5 errors away from “attention”
3.0 Scaling Up the Lexicon (Cont)
Practical NPL systems have lexicons with from 10,000 to 1000,000 root word forms– building such a sizable lexicon is very time consuming and
expensive• this has been a cost that dictionary publishing companies and
companies with NLP programs have not been willing to share
Wordnet is an exception to this rule:– freely available dictionary, developed by a group at Princeton (led
by George Miller)
– diagram on the next slide gives and example of the type of information returned by Wordnet about the word “ride”
3.0 Scaling Up the Lexicon (Cont)
Although dictionaries like Wordnet are useful, they do not provide all the lexical information one would like– frequency information is missing
• some of the meanings are far more likely than others
• Ex: “pen” usually means a writing instrument although (very rarely) it can mean a female swan
– semantic restrictions are missing• we need to know related information
• Ex: with the word “ride”, we may need to know whether we are talking about animals or vehicles because the actions in two cases are quite different
3.0 Scaling Up the Lexicon (Cont)
Current Topic 1.0 Definitions 1.1 Characteristics of Successful Machines 1.2 Practical Applications
– 1.2.1 machine translation
– 1.2.2 database access
– 1.2.3 text interpretation• 1.2.3.1 information retrieval
• 1.2.3.2 text categorization
• 1.2.3.3 extracting data from text
2.0 Efficient Parsing 3.0 Scaling up the Lexicon 4.0 List of References4.0 List of References
4.0 List of References
http://nats-www.informatik.uni-hamburg.de/ Natural Language Systems
http://www.he.net/~hedden/intro_mt.html Machine Translation: A Brief Introduction
http://foxnet.cs.cmu.edu/people/spot/frg/Tomita.txt Masaru Tomita
http://www.csli.stanford.edu/~aac/papers.html Ann Copestake's Online Publications
http://www.aventinus.de/ AVENTINUS advanced information system for multilingual drug enforcement
4.0 List of References (Cont)
http://ai10.bpa.arizona.edu/~ktolle/np.html AZ Noun Phraser
http://www.cam.sri.com/ Cambridge Computer Science Research Center
http://www-cgi.cam.sri.com/highlight/ Cambridge Computer Science Research Center, Highlight
http://www.cogs.susx.ac.uk/lab/nlp/ Natural Language Processing and Computational Linguistics at The
University of Sussex http://www.cogs.susx.ac.uk/lab/nlp/lexsys/ LexSys:
Analysis of Naturally-Occurring English Text with Stochastic Lexicalized Grammars
4.0 List of References (Cont)
http://www.georgetown.edu/compling/parsinfo.htm Georgetown University: General Description of Parsers
http://www.georgetown.edu/compling/graminfo.htm Georgetown University: General Information about Grammars
http://www.georgetown.edu/cball/ling361/ling361_nlp1.html Georgetown University: Introduction to Computational Linguistics
http://www.georgetown.edu/compling/module.html Georgetown University: Modularity in Natural Language Parsing
4.0 List of References (Cont)
Elaine Rich, Kevin Knight Artificial Intelligence Patrick Henry Winston Artificial Intelligence Philip C. Jackson Introduction to Artificial Intelligence
top related