probase : a knowledge base for text understanding

43
Probase : A Knowledge Base for Text Understanding

Post on 19-Dec-2015

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Probase : A Knowledge Base for Text Understanding

Probase : A Knowledge Base for Text Understanding

Page 2: Probase : A Knowledge Base for Text Understanding

Integrating, representing, and reasoning over human knowledge: the grand challenge of the 21st century.

Page 3: Probase : A Knowledge Base for Text Understanding

Knowledge BasesScope:

Cross domain / Encyclopedia: (Freebase, Cyc, etc)

Vertical domain / Directory: (Entity Cube, etc)

Challenges:completeness (e.g., a complete list of restaurants)accuracy (e.g., address, phone, menu of each restaurant)

data integration

Page 4: Probase : A Knowledge Base for Text Understanding

“Concepts are the glue that holds our mental world together.”

“By being able to place something into its proper

place in the concept hierarchy, one can learn a considerable

amount about it.”

Page 5: Probase : A Knowledge Base for Text Understanding

Yellow Pages or Knowledge Bases?

Existing Knowledge Bases are more like big Yellow Page Books

Goals:Complete list of itemsComplete information of each item

But humans do not need “complete” knowledge in order to understand the world

Page 6: Probase : A Knowledge Base for Text Understanding

What’s wrong with the yellow pages approach?

Rich in instances, poor in concepts

Understanding = forming concepts

Ontologies always have an elegant hierarchy?

Manually crafted ontologies are for human useOur mental world does not have an elegant hierarchy

Page 7: Probase : A Knowledge Base for Text Understanding

Probase: “It is better to use the world as its own model.”

Capture concepts (in our mental world)

Quantify uncertainty (for reasoning)

Page 8: Probase : A Knowledge Base for Text Understanding

Capture concept

s in human mind

Represent them

in a computabl

e form

Transform them

to machine

s

Machines have better

understanding of

human world

More than 2.7 million concepts automatically harnessed from 1.68 billion documents

Computation/Reasoning enabled by scoring:

Consensus: e.g., is there a company called Apple?

Typicality: e.g. how likely you think of Apple when you think about companies?

Ambiguity: e.g., does the word Apple, sans any context, represent Apple the company?

Similarity:e.g., how likely is an actor also a celebrity?

Freshness:e.g., Pluto as a dwarf planet is a claim more fresh than Pluto as a planet.…

Give machines a new CPU (Commonsense Processing Unit) powered by a distributed graph engine called Trinity.

A little knowledge goes a long way after machines acquire ahuman touch

12

3

4

Page 9: Probase : A Knowledge Base for Text Understanding

Probase:

Freebase:

Cyc:

2.7 M concepts

automatically harnessed2 K concepts

built by community

effort120 K concepts

25 years human labor

Probase has a big concept space

Page 10: Probase : A Knowledge Base for Text Understanding

Probase has 2.7 million concepts(frequency distribution)

Page 11: Probase : A Knowledge Base for Text Understanding

Probase vs. Freebase

Knowledge is black and

white.

Clean up everything.

Dirty data is unusable.

Correctness is a probability.

Live with dirty data.

Dirty data is very useful.

Uncertainty

Page 12: Probase : A Knowledge Base for Text Understanding

Probase Internals

artist

painter Picass

o

MovementBorn Died …

Cubism18811973 …

art

painting

Guernica

…Year Type

…1937Oil on Canvas

created by

Page 13: Probase : A Knowledge Base for Text Understanding

Data SourcesPatterns for single statements

NP such as {NP, NP, ..., (and|or)} NP such NP as {NP,}* {(or|and)} NP NP {, NP}* {,} or other NP NP {, NP}* {,} and other NP NP {,} including {NP ,}* {or | and} NP NP {,} especially {NP,}* {or|and} NP

Examples:Good: “rich countries such as USA and Japan …”Tough: “animals other than cats such as dogs …”Hopeless: “At Berklee, I was playing with cats such as Jeff Berlin, Mike Stern, Bill Frisell, and Neil Stubenhaus.”

Page 14: Probase : A Knowledge Base for Text Understanding

Examples

Animals other than dogs such as cats …

Eating fruits such as apple … can be good to your health.Eating disorders such as … can be bad to your health.

Page 15: Probase : A Knowledge Base for Text Understanding

Properties

Given a class, find its properties

Candidate seed properties:

“What is the [property] of [instance]?”

“Where”, “When”, “Who” are also considered

Page 16: Probase : A Knowledge Base for Text Understanding

Similarity between two concepts

Weighted linear combinations of

Similarity between the set of instances

Similarity between the set of attributes

(nation, country)

(celebrity, well-known politicians)

Page 17: Probase : A Knowledge Base for Text Understanding

Beyond noun phrases

Example: the verb “hit”

Small object, Hard surface(bullet, concrete), (ball, wall)

Natural disaster, Area(earthquake, Seattle), (Hurricane Floyd, Florida)

Emergency, Country(economic crisis, Mexico), (flood, Britain)

Page 18: Probase : A Knowledge Base for Text Understanding

Quantify Uncertainty

TypicalityP(concept | instance)P(instance | concept)

P(concept | property)P(property | concept)

Similaritysim(concept1, concept2)

the foundation of text understandingand reasoning

Page 19: Probase : A Knowledge Base for Text Understanding

Text Mining / IE: State of the Art

Bag of words based approach: e.g., LDABased on multiple document statisticsSimple bag-of-words, no semantics

Supervised learning: e.g., CRFLabeled training data requiredDifficulty for out-of-sample features

Lack of semantics

What role can a knowledgebase play?

Page 20: Probase : A Knowledge Base for Text Understanding

Shopping at Bellevue during TechFest

Five of us bought 5 Kinects and posed in front of an Apple store.

Apple store that sells fruits or apple store that sells iPads?

Page 21: Probase : A Knowledge Base for Text Understanding

Step by Step Understanding

Entity abstraction

Attribute abstraction

Short text/query(1-5 words)

understanding

Text block/docume

nt understanding

Page 22: Probase : A Knowledge Base for Text Understanding

Explicit Semantic Analysis (ESA)

Goal: latent topics => explicit topics

Approach: mapping text to Wikipedia articles

An inverted list that record words’ occurrence in wiki articlesGiven a document, we derive a distribution of wiki articles

Page 23: Probase : A Knowledge Base for Text Understanding

Explicit Semantic Analysis (ESA)

bag of words => bag of Wikipedia articles (a distribution over Wikipedia articles)

A bag of Wikipedia articles is not equivalent to a clear concept in our mental world.

Top 10 concepts for “Bank of America”

Bank, Bank of America, Bank of America Plaza (Atlanta), Bank of America Plaza (Dallas), MBNA, VISA (credit card), Bank of America Tower New York City, NASDAQ, MasterCard, Bank of America Corporate Center

Page 24: Probase : A Knowledge Base for Text Understanding

Short Text

Challenge:Not enough statistics

ApplicationsTwitterQuery/Search LogAnchor TextImage/video tagDocument paraphrasing and annotation

Page 25: Probase : A Knowledge Base for Text Understanding

Comparison of Knowledge Bases WordNet Wikipedia Freebase Probase

Cat

Feline; Felid; Adult male; Man; Gossip; Gossiper; Gossipmonger; Rumormonger; Rumourmonger; Newsmonger; Woman; Adult female; Stimulant; Stimulant drug; Excitant; Tracked vehicle; ...

Domesticated animals; Cats; Felines; Invasive animal species; Cosmopolitan species; Sequenced genomes; Animals described in 1758;

TV episode; Creative work; Musical recording; Organism classification; Dated location; Musical release; Book; Musical album; Film character; Publication; Character species; Top level domain; Animal; Domesticated animal; ...

Animal; Pet; Species; Mammal; Small animal; Thing; Mammalian species; Small pet; Animal species; Carnivore; Domesticated animal; Companion animal; Exotic pet; Vertebrate; ...

IBM N/A

Companies listed on the New York Stock Exchange; IBM; Cloud computing providers; Companies based in Westchester County, New York; Multinational companies; Software companies of the United States; Top 100 US Federal Contractors; ...

Business operation; Issuer; Literature subject; Venture investor; Competitor; Software developer; Architectural structure owner; Website owner; Programming language designer; Computer manufacturer/brand; Customer; Operating system developer; Processor manufacturer; ...

Company; Vendor; Client; Corporation; Organization; Manufacturer; Industry leader; Firm; Brand; Partner; Large company; Fortune 500 company; Technology company; Supplier; Software vendor; Global company; Technology company; ...

Language

Communication; Auditory communication; Word; Higher cognitive process; Faculty; Mental faculty; Module; Text; Textual matter;

Languages; Linguistics; Human communication; Human skills; Wikipedia articles with ASCII art

Employer; Written work; Musical recording; Musical artist; Musical album; Literature subject; Query; Periodical; Type profile; Journal; Quotation subject; Type/domain equivalent topic; Broadcast genre; Periodical subject; Video game content descriptor; ...

Instance of: Cognitive function; Knowledge; Cultural factor; Cultural barrier; Cognitive process; Cognitive ability; Cultural difference; Ability; Characteristic; Attribute of: Film; Area; Book; Publication; Magazine; Country; Work; Program; Media; City; ...

Page 26: Probase : A Knowledge Base for Text Understanding

When the machine sees the word ‘apple’

com

pany

bran

d

man

ufac

ture

r

vend

or

clien

t

com

petit

orriv

al

corp

orat

ion

tech

nolo

gy co

mpa

ny

indu

stry

lead

er

prod

uct

firm

reta

iler

play

er

orga

niza

tion

stor

eto

pic

com

pute

r

cust

omer

plat

form

oper

atin

g sy

stem

0

1000

2000

3000

4000

5000

6000

concepts

Page 27: Probase : A Knowledge Base for Text Understanding

When the machine sees ‘apple’ and ‘pear’ together

Page 28: Probase : A Knowledge Base for Text Understanding

When the machine sees ‘China’ and ‘Israel’ together

Page 29: Probase : A Knowledge Base for Text Understanding

What China is but Israel is not?

Page 30: Probase : A Knowledge Base for Text Understanding

What Israel is but China is not?

Page 31: Probase : A Knowledge Base for Text Understanding

What’s the difference from a text cloud?

Page 32: Probase : A Knowledge Base for Text Understanding
Page 33: Probase : A Knowledge Base for Text Understanding

When a machine sees attributes …

website president city motto state type director

1 2 3 4 5 6 70

100

200

300

400

500

600

# of Attributes

# o

f C

oncepts

Page 34: Probase : A Knowledge Base for Text Understanding

Entity AbstractionGiven a set of entities

(Naïve Bayes Rule)

Where is a concept, and

is computed based on the concept-entity co-occurrence

Page 35: Probase : A Knowledge Base for Text Understanding

How to Infer Concept from Attribute?

(university, florida state university, 75)(university, harvard university, 388)(university, university of california, 142)(country, china, 97346)(country, the united states , 91083)(country, india , 80351)(country, canada , 74481)

(florida state university, website, 34)(harvard university, website, 38) (university of california, city, 12)(china, capital, 43)(the united states , capital, 32)(india , population, 35)(canada , population, 21)

(university, website, 4568)(university, city, 2343)(country, capital, 4345)(country, population, 3234)……

Given a set of attributesThe Naïve Bayes Rule gives

where

Page 36: Probase : A Knowledge Base for Text Understanding

Examples

Concept EntityCo-occurrence

Concept Number

Entity Number P(e|c) P(c|e)

country india 80905 2262485 197915 0.03576 0.40879country china 98517 2262485 269127 0.04354 0.36606emerging market china 6556 29298 269127 0.22377 0.02436emerging market india 5702 29298 197915 0.19462 0.02881area china 2231 2525020 269127 0.00088 0.00829area india 1797 2525020 197915 0.00071 0.00908

ConceptAttribute P(c, a) P(c) P(a) P(a|c) P(c|a)

countrypopulation 4.08183 173.44931 41736.78060 0.02353 0.00010

country language 1.48795 173.44931 58584.50905 0.00858 0.00003emerging market language 4.52949 402.13772 58584.50905 0.01126 0.00008emerging market

population 16.54701 402.13772 41736.78060 0.04115 0.00040

• Concepts related to entities - “china” and “india”

• Concepts related to attributes – “language” and “population”

Page 37: Probase : A Knowledge Base for Text Understanding

When Type of Term is Unknown:Given a set of terms with unknown types Generative model

Using Naive Bayesian rule gives:

Discriminative model (Noisy-OR)

And using twice Bayesian rule gives:

where indicate “entity” and indicate “attribute”

Page 38: Probase : A Knowledge Base for Text Understanding

Examples

Given “china”, “india”, “language” and “population”, “emerging market” will be ranked as 1st

Concept EntityCo-occurrence

Concept Number

Entity Number P(e|c) P(c|e)

country india 80905 2262485 197915 0.03576 0.40879country china 98517 2262485 269127 0.04354 0.36606emerging market china 6556 29298 269127 0.22377 0.02436emerging market india 5702 29298 197915 0.19462 0.02881area china 2231 2525020 269127 0.00088 0.00829area india 1797 2525020 197915 0.00071 0.00908

ConceptAttribute P(c, a) P(c) P(a) P(a|c) P(c|a)

factorpopulation 75.74704 71073.46656 41736.78060 0.00107 0.00181

factor language 113.32628 71073.46656 58584.50905 0.00159 0.00193

countriespopulation 4.08183 173.44931 41736.78060 0.02353 0.00010

countries language 1.48795 173.44931 58584.50905 0.00858 0.00003emerging market language 4.52949 402.13772 58584.50905 0.01126 0.00008

emerging marketpopulation 16.54701 402.13772 41736.78060 0.04115 0.00040

Page 39: Probase : A Knowledge Base for Text Understanding

Example (Cont’d)

Page 40: Probase : A Knowledge Base for Text Understanding

Clustering Twitter MessagesProblem 1 (unique concepts): use keywords to retrieve tweets in 3 categories:

1. Microsoft, Yahoo, Google, IBM, Facebook2. cat, dog, fish, pet, bird3. Brazil, China, Russia, India

Problem 2 (concepts with subtle differences): use keywords to retrieve tweets in 4 categories:

1. United states, American, Canada2. Malaysia, China, Singapore, India, Thailand, Korea3. Angola, Egypt, Sudan, Zambia, Chad, Gambia, Congo4. Belgium, Finland, France, Germany, Greece, Spain, Switzerland

Page 41: Probase : A Knowledge Base for Text Understanding

Comparison Results

Page 42: Probase : A Knowledge Base for Text Understanding

Other Possible ApplicationsQuery Expansion

Information retrievalContent based advertisement

Short Text Classification/ClusteringTwitter analysis/searchText block tagging; image/video surrounding text summarization

Document Classification/Clustering/Summarization

News analysisOut-of-sample example classification/clustering

Page 43: Probase : A Knowledge Base for Text Understanding

© 2011 Microsoft Corporation. All rights reserved. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided afterthe date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.