text data mining
Post on 16-Aug-2015
118 Views
Preview:
TRANSCRIPT
INFORMATION EXTRACTION
What is Information Extraction?
Goal:
Extract structured information from unstructured (or loosely formatted) text.
Typical description of task: Identify named entities Identify relations between entities Populate a database
May also include: Event extraction Resolution of temporal expressions Wrapper induction (automatic construction of templates)
Applications: Natural language understanding, Question-answering, summarization, etc.
Information Extraction
IE extracts pieces of information that are salient to the user's needs
Find namedentities such as persons and organizations
Find find attributes of those entities or events they participate in
ContrastIR, which indicates which documents need to be read by
a user
Links between the extracted information and the original documents
are maintained to allow the user to reference context.
Schematic view of the Information Extraction Process
Information Extraction
Relevant IE Definitions
Entities:
Entities are the basic building blocks that can be found in text
documents.(An object of interest)
Examples: people, companies, locations, genes, and drugs.
Attributes:
Attributes are features of the extracted entities. (A property of an entity
such as its name, alias, descriptor or type)
Examples: the title of a person, the age of a person, and the type of an
organization.
Relevant IE Definitions
Facts: Facts are the relations that exist between entities. (a relationship held
between two or more entities such as the position of a person in a company)
Example: Employment relationship between a person and a company or phosphorylation between two proteins.
Events: An event is an activity or occurrence of interest in which entities
participate
An activity involving several entities such as a terrorist act, airline
crash, management change, new product introduction a merger
between two companies, a birthday and so on.
IE - Method
Extract raw text(html, pdf, ps, gif.) Tokenize Detect term boundaries
We extracted alpha 1 type XIII collagen from … Their house council recommended…
Detect sentence boundaries Tag parts of speech (POS)
John/noun saw/verb Mary/noun. Tag named entities
Person, place, organization, gene, chemical. Parse Determine co-reference Extract knowledge
Architecture: Components of IE Systems
Core linguistic components, adapted to or be useful for NLP tasks in general IE-specific components, address the core IE tasks. Domain-Independent Domain-specific components The following steps are performed in Domain-Independent part:
Meta-data analysis: Extraction of the title, body, structure of the body (identification of
paragraphs), and the date of the document. Tokenization:
Segmentation of the text into word-like units, called tokens and classification of their type, e.g., identification of capitalized words, words written in lowercase letters, hyphenated words, punctuation signs, numbers, etc.
Architecture: Components of IE Systems
Morphological analysis: Extraction of morphological information from tokens which constitute
potential word forms-the base form(or lemma), part of speech, other morphological tags depending on the part of speech.
e.g., verbs have features such as tense, mood, aspect, person, etc. Words which are ambiguous with respect to certain morphological categories
may undergo disambiguation. Typically part-of-speech disambiguation is performed.
Sentence/Utterance boundary detection: Segmentation of text into a sequence of sentences or utterances, each of which
is represented as a sequence of lexical items together with their features. Common Named-entity extraction:
Detection of domain-independent named entities, such as temporal expressions, numbers and currency, geographical references, etc.
Architecture: Components of IE Systems
Phrase recognition: Recognition of small-scale, local structures such as noun phrases, verb groups,
prepositional phrases, acronyms, and abbreviations. Syntactic analysis:
Computation of a dependency structure (parse tree) of the sentence based on the sequence of lexical items and small-scale structures.
Syntactic analysis may be deep or shallow. In the former case, compute all possible interpretations (parse trees) and
grammatical relations within the sentence. In the latter case, the analysis is restricted to identification of non-recursive
structures or structures with limited amount of structural recursion, which can be identified with a high degree of certainty, and linguistic phenomena which cause problems (ambiguities) are not handled and represented with underspecified structures.
Architecture: Components of IE Systems
The core IE tasks: NER, Co-reference resolution, and Detection of relations and events
Typically domain-specific, and are supported by domain-specific system components and resources.
Domain-specific processing is also supported on a lower level by detection of specialized terms in text.
Architecture: IE System In the domain specific core of the processing chain, a NER component is
applied to identify the entities relevant in a given domain. Patterns may then be applied to:
Identify text fragments, which describe the target relations and events, and Extract the key attributes to fill the slots in the template representing the
relation/event.
IE System - Architecture
Typical Architecture of an Information Extraction System
Architecture: Components of IE Systems
A co-reference component identifies mentions that refer to the same entity.
Partially-filled templates are fused and validated using domain-specific
inference rules in order to create full-fledged relation/event descriptions.
Several software packages to provide various tools that can be used in the
process of developing an IE system, ranging from core linguistic processing
modules (e.g., language detectors, sentence splitters), to general IE-oriented
NLP frameworks.
IE Task Types
Named Entity Recognition (NER)
Co-reference Resolution (CO)
Relation Extraction (RE)
Event Extraction (EE)
Named Entity Recognition
Named Entity Recognition (NER) addresses the problem of the identification (detection) and classification of predefined types of named entities, Such as organizations (e.g., ‘World Health Organisation’), persons (e.g., ‘Mohamad
Gouse’), place names (e.g., ‘the Baltic Sea’), temporal expressions (e.g., ‘15 January 1984’), numerical and currency expressions (e.g., ‘20 MillionEuros’), etc.
NER task include extracting descriptive information from the text about the detected entities through filling of a small-scale template. Example, in the case of persons, it may include extracting the title, position,
nationality, gender, and other attributes of the person. NER also involves lemmatization (normalization) of the named entities, which is
particularly crucial in highly inflective languages. Example in Polish there are six inflected forms of the name ‘Mohamad Gouse’
depending on grammatical case: ‘Mohamad Gouse’ (nominative), ‘Mohamad Gouseego’ (genitive), Mohamad Gouseemu (dative), ‘Mohamad Gouseiego’ (accusative), Mohamad Gousem (instrumental), Mohamad Gousem (locative), Mohamad Gouse(vocative).
Co-Reference
Co-reference Resolution (CO) requires the identification of multiple (coreferring) mentions of the same entity in the text.
Entity mentions can be: (a) Named, in case an entity is referred to by name
e.g., ‘General Electric’ and ‘GE’ may refer to the same real-world entity. (b) Pronominal, in case an entity is referred to with a pronoun
e.g., in ‘John bought food. But he forgot to buy drinks.’, the pronoun he refers to John. (c) Nominal, in case an entity is referred to with a nominal phrase
e.g., in ‘Microsoft revealed its earnings. The company also unveiled future plans.’ the definite noun phrase The company refers to Microsoft.
(d) Implicit, as in case of using zero-anaphora1 e.g., in the Italian text fragment ‘OEBerlusconii ha visitato il luogo del disastro. i Ha
sorvolato, con l’elicottero.’ (Berlusconi has visited the place of disaster. [He] flew over with a helicopter.) the
second sentence does not have an explicit realization of the reference to Berlusconi.
Relation Extraction
Relation Extraction (RE) is the task of detecting and classifying predefined
relationships between entities identified in text.
For example:
EmployeeOf(Steve Jobs,Apple): a relation between a person and an
organisation, extracted from ‘Steve Jobs works for Apple’
LocatedIn(Smith,New York): a relation between a person and location,
extracted from ‘Mr. Smith gave a talk at the conference in New York’,
SubsidiaryOf(TVN,ITI Holding): a relation between two companies,
extracted from ‘Listed broadcaster TVN said its parent company, ITI
Holdings, is considering various options for the potential sale.
The set of relations that may be of interest is unlimited, the set of relations
within a given task is predefined and fixed, as part of the specification of the
task.
Event Extraction
Event Extraction (EE) refers to the task of identifying events in free text and deriving detailed and structured information about them, ideally identifying who did what to whom, when, where, through what methods (instruments), and why.
Usually, event extraction involves extraction of several entities and relationships between them.
For instance, extraction of information on terrorist attacks from the text fragment ‘Masked gunmen armed with assault rifles and grenades attacked a wedding party in mainly Kurdish southeast Turkey, killing at least 44 people.’ Involves identification of perpetrators (masked gunmen), victims (people),
number of killed/injured (at least 44), weapons and means used (rifles and grenades), and location (southeast Turkey).
Another example is the extraction of information on new joint ventures, where the aim is to identify the partners, products, profits and capitalization of the joint venture.
EE is considered to be the hardest of the four IE tasks.
IE Subtask: Named Entity Recognition
Detect and classify all proper names mentioned in text
What is a proper name? Depends on application.
People, places, organizations, times, amounts, etc.
Names of genes and proteins
Names of college courses
NER Example
Find extent of each mention Classify each mention Sources of ambiguity
Different strings that map to the same entity Equivalent strings that map to different entities (e.g., U.S. Grant)
Approaches to NER
Early systems: hand-written rules
Statistical systems
Supervised learning (HMMs, Decision Trees, MaxEnt, SVMs, CRFs)
Semi-supervised learning (bootstrapping)
Unsupervised learning (rely on lexical resources, lexical patterns, and
corpus statistics)
A Sequence-Labeling Approach using CRFs
Input: Sequence of observations (tokens/words/text) Output: Sequence of states (labels/classes)
B: Begin I: Inside O: Outside Some evidence that including L (Last) and U (Unit length) is
advantageous (Ratinov and Roth 09) CRFs defines a conditional probability p(Y|X) over label sequences Y
given an observation sequence X No effort wasted modeling the observations (in contrast to joint
models like HMMs) Arbitrary features of the observations may be captured by the model
Linear Chain CRFs
Simplest and most common graph structure, used for sequence modeling
Inference can be done efficiently using dynamic programming O(|X||Y|2)
Linear Chain CRFs
NER Features
Several feature families used, all time-shifted by -2, -1, 0, 1, 2:
The word itself
Capitalization and digit patterns (shape patterns)
8 lexicons entered by hand (e.g., honorifics, days, months)
15 lexicons obtained from web sites (e.g., countries, publicly-traded
companies, surnames, stopwords, universities)
25 lexicons automatically induced from the web (people names,
organizations, NGOs, nationalities)
Limitations of Conventional NER(and IE)
Supervised learning
Expensive
Inconsistent
Worse for relations and events!
Fixed, narrow, pre-specified sets of entity types
Small, homogeneous corpora (newswire, seminar announcements)
Evaluating Named Entity Recognition
Recall that recall is the ratio of the number of correctly labeled responses to the total that should have been labeled.
Precision is the ratio of the number of correctly labeled responses to the total labeled.
The F-measure provides a way to combine these two measures into a single metric.
key
correct
N
Nrecall
incorrectcorrect
correct
NN
Nprecision
recallprecision
recallprecisionF
2
2 )1(
What is Relation Extraction?
Typically defined as identifying relations between two entities
Relations Subtypes Examples
AffiliationsPersonal
OrganizationalArtifactual
married to, mother of spokesman for, president of owns, invented, produces
GeospatialProximityDirectional
near, on outskirtssoutheast of
Part-ofOrganizational
Politicala unit of, parent ofannexed, acquired
Typical (Supervised) Approach
FindEntities( ): Named entity recognizer Related?( ): Binary classier that says whether two entities are involved
in a relation ClassifyRelation( ): Classier that labels relations discovered by
Related?( )
Typical (Semi-Supervised) Approach
NELL: Never-Ending Language Learner
NELL: Can computers learn to read?
Goal: create a system that learns to read the web
Reading task: Extract facts from text found on the web
Learning task: Iteratively improve reading competence.
http://rtw.ml.cmu.edu/rtw/
Approach
Inputs
Ontology with target categories and relations (i.e., predicates)
Small number of seed examples for each
Set of constraints that couple the predicates
Large corpus of unlabeled documents
Output: new predicate instances
Semi-supervised bootstrap learning methods
Couple the learning of functions to constrain the problem
Exploit redundancy of information on the web.
Coupled Semi-Supervised Learning
Types of Coupling
1. Mutual Exclusion (output constraint) Mutually exclusive predicates can't both be satisfied by the same input x E.g., x cannot be a Person and a Sport
2. Relation Argument Type-Checking (compositional constraint) Arguments of relations declared to be of certain categories E.g., CompanyIsInEconomicSector(Company, EconomicSector)
3. Unstructured and Semi-Structured Text Features
(multi-view-agreement constraint) Look at different views (like co-training) Require classifiers agree E.g., freeform textual contexts and semi-structured contexts
System Architecture
Coupled Pattern Learner (CPL)
Free-text extractor that learns contextual patterns to extract predicate instances
Use mutual exclusion and type-checking constraints to filter candidates instances
Rank instances and patterns by leveraging redundancy: if an instance or pattern occurs more frequently, it's ranked higher
Coupled SEAL (CSEAL)
SEAL (Set Expander for Any Language) is a wrapper induction algorithm
Operates over semi-structured text such as web pages
Constructs page-specific extraction rules (wrappers) that are human- and
markup-language independent
CSEAL adds mutual-exclusion and type-checking constraints
CSEAL Wrappers
Seeds: Ford, Nissan, Toyota arg1 is a placeholder for extracting instances
Open IE and TextRunner
Motivations: Web corpora are massive, introducing scalability concerns Relations of interest are unanticipated, diverse, and abundant Use of “heavy” linguistic technology (NERs and parsers) don't work
well Input: a large, heterogeneous Web corpus
9M web pages, 133M sentences No pre-specified set of relations
Output: huge set of extracted relations 60.5M tuples, 11.3M high-probability tuples Tuples are indexed for searching
TextRunner Architecture
Learner outputs a classier that labels trustworthy extractions Extractor finds and outputs trustworthy extractions Assessor normalizes and scores the extractions
Architecture: Self-Supervised Learner
1. Automatically labels training data Uses a parser to induce dependency structures Parses a small corpus of several thousand sentences Identifies and labels a set of positive and negative extractions using
relation-independent heuristics An extraction is a tuple t = (ej , ri,j , ej)
Entities are base noun phrases Uses parse to identify potential relations
2. Trains a classifier Domain-independent, simple non-parse features E.g., POS tags, phrase chunks, regexes, stopwords, etc.
Architecture: Single-Pass Extractor
1. POS tag each word
2. Identify entities using lightweight NP chunker
3. Identify relations
4. Classify them
Architecture: Redundancy-Based Assessor
Take the tuples and perform
Normalization, deduplication, synonym resolution
Assessment
Number of distinct sentences from which each extraction was found serves
as a measure of confidence
Entities and relations indexed using Lucene
Template Filling
The task of template-filling is to find Template filling documents that
evoke such situations and then fill the slots in templates with appropriate
material.
These slot fillers may consist of
Text segments extracted directly from the text, or
Concepts that have been inferred from text elements via some
additional processing (times, amounts, entities from an ontology, etc.).
Applications of IE
Infrastructure for IR and for Categorization
Information Routing
Event Based Summarization
Automatic Creation of Databases
Company acquisitions
Sports scores
Terrorist activities
Job listings
Corporate titles and addresses
Inductive Algorithms for IE
Rule Induction algorithms produce symbolic IE rules based on a corpus of
annotated documents.
WHISK
BWI
The (LP)2 Algorithm
The inductive algorithms are suitable for semi-structured domains, where
the rules are fairly simple, whereas when dealing with free text documents
(such as news articles) the probabilistic algorithms perform much better.
WHISK
WHISK is a supervised learning algorithm that uses hand-tagged
examples for learning information extraction rules.
Works for structured, semi-structured and free text.
Extract both single-slot and multi-slot information.
Doesn’t require syntactic preprocessing for structured and semi-structured
text, and recommend syntactic analyzer and semantic tagger for free text.
The extraction pattern learned by WHISK is in the form of limited regular
expression, considering tradeoff between expressiveness and efficiency.
Example: IE task of extracting neighborhood, number of bedrooms and
price from the text
WHISK
An Example from the Rental Ads domain
An example extraction pattern which can be learned by WHISK is,
*(Neighborhood) *(Bedroom) * ‘$’(Number)
Neighborhood, Bedroom, and Number – Semantic classes specified by domain experts.
WHISK learns the extraction rules using a top-down covering algorithm. The algorithm begins learning a single rule by starting with an empty rule; Then add one term at a time until either no negative examples are covered
by the rule or the pre-pruning criterion has been satisfied.
We add terms to specialize it in order to reduce the Laplacian error of the rule. The Laplacian expected error is defined as,
Where, e is the number of negative extraction and
n is the number of positive extractions on the training instances (terms)
Example:
For instance, from the text “3 BR, upper flr of turn of ctry. Incl gar, grt N. Hill loc 995$. (206)-999-9999,” the rule would extract the frame Bedrooms – 3, Price – 995.
The “*” char in the pattern will match any number of characters (unlimited jump).
Patterns enclosed in parentheses become numbered elements in the output pattern, and hence (Digit) is $1 and (number) is $2.
1
1
n
eLaplacian
Boosted Wrapper Induction(BWI)
The BWI is a system that utilizes wrapper induction techniques for
traditional Information Extraction.
IE is treated as a classification problem that entails trying to approximate
two boundary functions Xbegin(i ) and Xend(i ).
Xbegin(i ) is equal to 1 if the ith token starts a field that is part of the frame to
be extracted and 0 otherwise. Xend(i ) is defined in a similar way for tokens that end a field.
The learning algorithm approximates each X function by taking a set of pairs of the form {i, X}(i) as training data.
Each field is extracted by a wrapper W=<F, A, H> where F is a set of begin boundary detectors A is a set of end boundary detectors H(k) is the probability that the field has length k A boundary detector is just a sequence of tokens with wild cards (some kind of a
regular expression).
W(i, j ) is a nave Bayesian approximation of the probability The BWI algorithm learns two detectors by using a greedy algorithm that extends the
prefix and suffix patterns while there is an improvement in the accuracy. The sets F(i) and A(i) are generated from the detectors by using the AdaBoost
algorithm. The detector pattern can include specific words and regular expressions that work on a
set of wildcards such as <num>, <Cap>, <LowerCase>, <Punctuation> and <Alpha>.
otherwise
ijHjAiFifjiW
0
)1()()(1),(
),()( iFCiF kk
FK )()( iACiAk
kAk
(LP)2 Algorithm
The (LP)2 algorithm learns from an annotated corpus and induces two sets of rules: Tagging rules generated by a bottom-up generalization process correction rules that correct mistakes and omissions done by the
tagging rules. A tagging rule is a pattern that contains conditions on words preceding the
place where a tag is to be inserted and conditions on the words that follow the tag.
Conditions can be either words, lemmas, lexical categories (such as digit, noun, verb, etc), case (lower or upper), and semantic categories (such as time-id, cities, etc).
The (LP)2 algorithm is a covering algorithm that tries to cover all training examples.
The initial tagging rules are generalized by dropping conditions.
IE and Text Summarization User’s perspective,
IE can be glossed as "I know what specific pieces of information I want–just find them for me!",
Summarization can be glossed as "What’s in the text that is interesting?". Technically, from the system builder’s perspective, the two applications blend into
each other. The most pertinent technical aspects are:
Are the criteria of interestingness specified at run-time or by the system builder?
Is the input a single document or multiple documents? Is the extracted information manipulated, either by simple content delineation
routines or by complex inferences, or just delivered verbatim? What is the grain size of the extracted units of information–individual entities
and events, or blocks of text? Is the output formulated in language, or in a computer-internal knowledge
representation?
Text Summarization
An information access technology that given a
document or sets of related documents, extracts the
most important content from the source(s) taking into
account the user or task at hand, and presents this
content in a well formed and concise text
Text Summarization Techniques
Topic Representation
Influence of Context
Indicator Representations
Pattern Extraction
Text Summarization
Input: one or more text documents
Output: paragraph length summary
Sentence extraction is the standard method Using features such as key words, sentence position in document,
cue phrases Identify sentences within documents that are salient Extract and string sentences together
Machine learning for extraction Corpus of document/summary pairs Learn the features that best determine important sentences Summarization of scientific articles
A Summarization Machine
EXTRACTS
ABSTRACTS
?
MULTIDOCS
Extract Abstract
Indicative
Generic
Background
Query-oriented
Just the news
10%
50%
100%
Very BriefBrief
Long
Headline
Informative
DOCQUERY
CASE FRAMESTEMPLATESCORE CONCEPTSCORE EVENTSRELATIONSHIPSCLAUSE FRAGMENTSINDEX TERMS
The Modules of the Summarization Machine
EXTRACTION
INTERPRETATION
EXTRACTS
ABSTRACTS
?
CASE FRAMESTEMPLATESCORE CONCEPTSCORE EVENTSRELATIONSHIPSCLAUSE FRAGMENTSINDEX TERMS
MULTIDOC EXTRACTS
GENERATION
FILTERING
DOCEXTRACTS
What is Summarization?What is Summarization?
Data as input (database, software trace, expert system), text summary as output
Text as input (one or more articles), paragraph summary as output
Multimedia in input or output
Summaries must convey maximal information in minimal space
Involves: Three stages (typically) Content identification
Find/Extract the most important material Conceptual organization Realization
Types of summaries
Purpose Indicative, informative, and critical summaries
Form Extracts (representative paragraphs/sentences/phrases) Abstracts: “a concise summary of the central subject matter of a
document”. Dimensions
Single-document vs. multi-document Context
Query-specific vs. query-independent Generic vs. query-oriented
provides author’s view vs. reflects user’s interest.
Genres
Headlines
Outlines
Minutes
Biographies
Abridgments
Sound bites
Movie summaries
Chronologies, etc.
Aspects that Describe Summaries
Input subject type: domain genre: newspaper articles, editorials, letters, reports... form: regular text structure, free-form source size: single doc, multiple docs (few,many)
Purpose situation: embedded in larger system (MT, IR) or not? audience: focused or general usage: IR, sorting, skimming...
Output completeness: include all aspects, or focus on some? format: paragraph, table, etc. style: informative, indicative, aggregative, critical...
Single Document SummarizationSystem Architecture
Extraction
Sentence reduction
Generation
Sentence combination
Input: single document
Extracted sentences
Output: summary
Corpus
Decomposition
Lexicon
Parser
Co-reference
Multi-Document Summarization
Monitor variety of online information sources News, multilingual Email
Gather information on events across source and time Same day, multiple sources Across time
Summarize Highlighting similarities, new information, different perspectives,
user specified interests in real-time
Example System: SUMMARIST
Three stages:
1. Topic Identification Modules: Positional Importance, Cue Phrases (under
construction), Word Counts, Discourse Structure (under construction), ...
2. Topic Interpretation Modules: Concept Counting /Wavefront, Concept
Signatures (being extended)
3. Summary Generation Modules (not yet built): Keywords, Template Gen, Sent.
Planner & Realizer
SUMMARY = TOPIC ID + INTERPRETATION + GENERATION
From extract to abstract:
topic interpretation or concept fusion.
Experiment (Marcu, 98): Got 10 newspaper texts, with
human abstracts. Asked 14 judges to extract
corresponding clauses from texts, to cover the same content.
Compared word lengths of extracts to abstracts: extract_length 2.76 abstract_length !!
xx xxx xxxx x xx xxxx xxx xx xxx xx xxxxx xxxx xx xxx xx x xxx xx xx xxx x xxx xx xxx x xx x xxxx xxxx xxxx xxxx xxxx xxxxxx xx xx xxxx x xxxxx x xx xx xxxxx x x xxxxx xxxxxx xxxxxx x xxxxxxxx xx x xxxxxxxxxx xx xx xxxxx xxx xx x xxxx xxxx xxx xxxx xx
Topic Interpretation
xxx xx xxx xxxx xxxxx x xxxx x xx xxxxxx xxx xxxx xx x xxxxxx xxxx x xxx x xxxxx xx xxxxx x x xxxxxxxxx xx x xxxxxxxxxx xx xx xxxxx xxx xxxxx xx xxxx x xxxxxxx xxxxx x
Some Types of Interpretation
Concept generalization:
Sue ate apples, pears, and bananas Sue ate fruit
Meronymy replacement:
Both wheels, the pedals, saddle, chain… the bike
Script identification:
He sat down, read the menu, ordered, ate, paid, and left He ate at the
restaurant
Metonymy:
A spokesperson for the US Government announced that… Washington
announced that...
General Aspects of Interpretation
Interpretation occurs at the conceptual level...
…words alone are polysemous (bat animal and sports
instrument) and combine for meaning (alleged murderer
murderer).
For interpretation, you need world knowledge...
…the fusion inferences are not in the text!
Extract a pattern for each event in training data part of speech & mention tags
Example: Japanese political leaders GPE JJ PER
Japanese political Leaders
GPE PER
NN JJ NN
GPE JJ PER
Text
Ents
POS
Pattern
Pattern Extraction
Summarization - Scope
Data preparation: Collect large sets of texts with abstracts, all genres. Build large corpora of <Text, Abstract, Extract> tuples. Investigate relationships between extracts and abstracts (using <Extract,
Abstract> tuples). Types of summary:
Determine characteristics of each type. Topic Identification:
Develop new identification methods (discourse, etc.). Develop heuristics for method combination (train heuristics on <Text,
Extract> tuples).
Summarization - Scope
Concept Interpretation (Fusion): Investigate types of fusion (semantic, evaluative…). Create large collections of fusion knowledge/rules (e.g., signature
libraries, generalization and partonymic hierarchies, metonymy rules…).
Study incorporation of User’s knowledge in interpretation. Generation:
Develop Sentence Planner rules for dense packing of content into sentences (using <Extract, Abstract> pairs).
Evaluation: Develop better evaluation metrics, for types of summaries.
Apriori Algorithm
In computer science and data mining, Apriori is a classic algorithm for learning association rules.
Apriori is designed to operate on databases containing transactions. Example, collections of items bought by customers, or details of a
website frequentation. The algorithm attempts to find subsets which are common to at least a
minimum number C (the cutoff, or confidence threshold) of the itemsets. Apriori uses a "bottom up" approach, where frequent subsets are extended
one item at a time (a step known as candidate generation, and groups of candidates are tested against the data.
The algorithm terminates when no further successful extensions are found. Apriori uses breadth-first search and a hash tree structure to count candidate
item sets efficiently.
Find rules in two stages
Agarwal et.al., divided the problem of finding good rules into two phases:
1. Find all itemsets with a specified minimal support (coverage). An itemset
is just a specific set of items, e.g. {apples, cheese}. The Apriori algorithm
can efficiently find all itemsets whose coverage is above a given
minimum.
2. Use these itemsets to help generate interersting rules. Having done stage
1, we have considerably narrowed down the possibilities, and can do
reasonably fast processing of the large itemsets to generate candidate
rules.
Terminology
k-itemset : a set of k items. E.g.
{beer, cheese, eggs} is a 3-itemset
{cheese} is a 1-itemset
{honey, ice-cream} is a 2-itemset
support: an itemset has support s% if s% of the records in the DB contain that
itemset.
minimum support: the Apriori algorithm starts with the specification of a
minimum level of support, and will focus on itemsets with this level or
above.
Terminology
large itemset: doesn’t mean an itemset with many items. It means one
whose support is at least minimum support.
Lk : the set of all large k-itemsets in the DB.
Ck : a set of candidate large k-itemsets. In the algorithm we will look at, it
generates this set, which contains all the k-itemsets that might be large,
and then eventually generates the set above.
Terminology
sets: Let A be a set (A = {cat, dog}) and
let B be a set (B = {dog, eel, rat}) and
let C = {eel, rat}
I use ‘A + B’ to mean A union B.
So A + B = {cat, dog, eel, rat}
When X is a subset of Y, I use Y – X to mean the set of things in Y which are not in X.
E.g. B – C = {dog}
Apriori Algorithm
Find all large 1-itemsetsFor (k = 2 ; while Lk-1 is non-empty; k++)
{Ck = apriori-gen(Lk-1)For each c in Ck, initialise c.count to zero
For all records r in the DB {Cr = subset(Ck, r); For each c in Cr , c.count++ }
Set Lk := all c in Ck whose count >= minsup
} /* end -- return all of the Lk sets.
The algorithm returns all of the (non-empty) Lk sets, which gives us an excellent
start in finding interesting rules (although the large itemsets themselves will
usually be very interesting and useful.
Example: Generation of candidate itemsets and frequent itemsets, where the minimum support count is 2.
Apriori Merits/Demerits
Merits
Uses large itemset property
Easily parallelized
Easy to implement
Demerits
Assumes transaction database is memory resident.
Requires many database scans.
Summary
Association Rules form an very applied data mining approach.
Association Rules are derived from frequent itemsets.
The Apriori algorithm is an efficient algorithm for finding all frequent
itemsets.
The Apriori algorithm implements level-wise search using frequent
item property.
The Apriori algorithm can be additionally optimized.
There are many measures for association rules.
FP-Growth Algorithm
Frequent Pattern Mining: An Example
Given a transaction database DB and a minimum support threshold ξ, Find all frequent patterns (item sets) with support no less than ξ.
TID Items bought 100 {f, a, c, d, g, i, m, p}200 {a, b, c, f, l, m, o}
300 {b, f, h, j, o}
400 {b, c, k, s, p}
500 {a, f, c, e, l, p, m, n}
DB:
Minimum support: ξ =3
Input:
Output: all frequent patterns, i.e., f, a, …, fa, fac, fam, fm,am…
Problem Statement: How to efficiently find all frequent patterns?
Compress a large database into a compact, Frequent-Pattern tree (FP-tree)
structure
highly compacted, but complete for frequent pattern mining
avoid costly repeated database scans
Develop an efficient, FP-tree-based frequent pattern mining method (FP-
growth)
A divide-and-conquer methodology: decompose mining tasks into
smaller ones
Avoid candidate generation: sub-database test only.
Overview of FP-Growth: Ideas
FP-tree: Construction and Design
Construct FP-tree
Two Steps:
1. Scan the transaction DB for the first time, find frequent items (single item
patterns) and order them into a list L in frequency descending order.
e.g., L={f:4, c:4, a:3, b:3, m:3, p:3}
In the format of (item-name, support)
2. For each transaction, order its frequent items according to the order in L;
Scan DB the second time, construct FP-tree by putting each frequency
ordered transaction onto it.
89
FP-tree Example: step 1
Item frequency f 4c 4a 3b 3m 3p 3
TID Items bought 100 {f, a, c, d, g, i, m, p}200 {a, b, c, f, l, m, o}
300 {b, f, h, j, o}400 {b, c, k, s, p}500 {a, f, c, e, l, p, m, n}
L
Step 1: Scan DB for the first time to generate L
By-Product of First Scan of Database
90
FP-tree Example: step 2
TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
Step 2: scan the DB for the second time, order frequent items in each transaction
91
FP-tree Example: step 2
Step 2: construct FP-tree
{}
f:1
c:1
a:1
m:1
p:1
{f, c, a, m, p}
{}
{}
f:2
c:2
a:2
b:1m:1
p:1 m:1
{f, c, a, b, m}
NOTE: Each transaction
corresponds to one path in the FP-tree
92
FP-tree Example: step 2
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Step 2: construct FP-tree
{}
f:3
c:2
a:2
b:1m:1
p:1 m:1
{f, b}
b:1
{c, b, p}
c:1
b:1
p:1
{}
f:3
c:2
a:2
b:1m:1
p:1 m:1
b:1{f, c, a, m, p}
Node-Link
93
Construction Example
Final FP-tree
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item head fcabmp
FP-Tree Definition
FP-tree is a frequent pattern tree . Formally, FP-tree is a tree structure defined below:
1. One root labeled as “null", a set of item prefix sub-trees as the children of the root, and a frequent-item header table.
2. Each node in the item prefix sub-trees has three fields: item-name : register which item this node represents, count, the number of transactions represented by the portion of the path
reaching this node, node-link that links to the next node in the FP-tree carrying the same
item-name, or null if there is none.
3. Each entry in the frequent-item header table has two fields, item-name, and head of node-link that points to the first node in the FP-tree carrying the
item-name.
Advantages of the FP-tree Structure
The most significant advantage of the FP-tree
Scan the DB only twice and twice only.
Completeness:
The FP-tree contains all the information related to mining frequent patterns
(given the min-support threshold).
Compactness:
The size of the tree is bounded by the occurrences of frequent items
The height of the tree is bounded by the maximum number of items in a
transaction
FP-growth:Mining Frequent Patterns
Using FP-tree
Mining Frequent Patterns Using FP-tree
General idea (divide-and-conquer)
Recursively grow frequent patterns using the FP-tree: looking for shorter
ones recursively and then concatenating the suffix:
For each frequent item, construct its conditional pattern base, and then its
conditional FP-tree;
Repeat the process on each newly created conditional FP-tree until the
resulting FP-tree is empty, or it contains only one path (single path will
generate all the combinations of its sub-paths, each of which is a frequent
pattern)
3 Major Steps
Starting the processing from the end of list L:
Step 1:
Construct conditional pattern base for each item in the header table
Step 2:
Construct conditional FP-tree from each conditional pattern base
Step 3:
Recursively mine conditional FP-trees and grow frequent patterns
obtained so far. If the conditional FP-tree contains a single path, simply
enumerate all the patterns
Step 1: Construct Conditional Pattern Base
Starting at the bottom of frequent-item header table in the FP-tree Traverse the FP-tree by following the link of each frequent item Accumulate all of transformed prefix paths of that item to form a
conditional pattern base
Conditional pattern bases
item cond. pattern base
p fcam:2, cb:1
m fca:2, fcab:1
b fca:1, f:1, c:1
a fc:3
c f:3
f { }
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item head fcabmp
Properties of FP-Tree
Node-link property
For any frequent item ai, all the possible frequent patterns that contain ai
can be obtained by following ai's node-links, starting from ai's head in
the FP-tree header.
Prefix path property
To calculate the frequent patterns for a node ai in a path P, only the
prefix sub-path of ai in P need to be accumulated, and its frequency
count should carry the same count as node ai.
Step 2: Construct Conditional FP-tree
For each pattern base Accumulate the count for each item in the base Construct the conditional FP-tree for the frequent items of the pattern
base
m- cond. pattern base:fca:2, fcab:1
{}
f:3
c:3
a:3m-conditional FP-tree
{}
f:4
c:3
a:3
b:1m:2
m:1
Header TableItem head f 4c 4a 3b 3m 3p 3
Step 3: Recursively mine the conditional FP-tree
{}
f:3
c:3
a:3
conditional FP-tree of “am”: (fc:3)
{}
f:3
c:3
conditional FP-tree of “cm”: (f:3)
{}
f:3
conditional FP-tree of
“cam”: (f:3){}
f:3
conditional FP-tree of “fm”: 3
conditional FP-tree ofof “fam”: 3
conditional FP-tree of “m”: (fca:3)
add“a”
add“c”
add“f”
add“c”
add“f”
Frequent Pattern fcam
add“f”
conditional FP-tree of “fcm”: 3
Frequent Pattern Frequent Pattern
Frequent Pattern
Frequent Pattern
Frequent Pattern
Frequent Pattern
Frequent Pattern
add“f”
Principles of FP-Growth
Pattern growth property
Let be a frequent itemset in DB, B be 's conditional pattern base,
and be an itemset in B. Then is a frequent itemset in DB iff is
frequent in B.
Is “fcabm ” a frequent pattern?
“fcab” is a branch of m's conditional pattern base
“b” is NOT frequent in transactions containing “fcab ”
“bm” is NOT a frequent itemset.
Conditional Pattern Bases and Conditional FP-Tree
EmptyEmptyf
{(f:3)}|c{(f:3)}c
{(f:3, c:3)}|a{(fc:3)}a
Empty{(fca:1), (f:1), (c:1)}b
{(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m
{(c:3)}|p{(fcam:2), (cb:1)}p
Conditional FP-treeConditional pattern baseItem
order of L
Single FP-tree Path Generation
Suppose an FP-tree T has a single path P.
The complete set of frequent pattern of T can be generated by enumeration
of all the combinations of the sub-paths of P
{}
f:3
c:3
a:3
m-conditional FP-tree
All frequent patterns concerning m: combination of {f, c, a} and mm, fm, cm, am, fcm, fam, cam, fcam
Summary of FP-Growth Algorithm
Mining frequent patterns can be viewed as first mining 1-itemset and
progressively growing each 1-itemset by mining on its conditional pattern base
recursively
Transform a frequent k-itemset mining problem into a sequence of k frequent 1-
itemset mining problems via a set of conditional pattern bases
Efficiency Analysis
Facts: usually
1. FP-tree is much smaller than the size of the DB
2. Pattern base is smaller than original FP-tree
3. Conditional FP-tree is smaller than pattern base
Mining process works on a set of usually much smaller pattern
bases and conditional FP-trees
Divide-and-conquer and dramatic scale of shrinking
Performance Improvement
Projected DBsDisk-resident FP-tree
FP-tree Materialization
FP-tree Incremental update
partition the DB into a set of projected DBs and then construct an FP-tree and mine it in each projected DB.
Store the FP-tree in the hark disks by using B+ tree structure to reduce I/O cost.
a low ξ may usually satisfy most of the mining queries in the FP-tree construction.
How to update an FP-tree when there are new data? • Reconstru
ct the FP-tree
• Or do not update the FP-tree
top related