mark sanderson, university of sheffield university of sheffield ciir, university of massachusetts...

41
Mark Sanderson, University of Sheffi University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson, Bruce Croft

Post on 19-Dec-2015

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

University of SheffieldCIIR, University of Massachusetts

Deriving concept hierarchies from text

Mark Sanderson, Bruce Croft

Page 2: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

The question is...

� What paper already presented at this SIGIR is most like the one you’re about to see?

� We’ll have the answer, right after this!

Page 3: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

Concept hierarchies from documents?

� Hierarchy ofconcepts, Yahoo� General down to

specific

� Child under one or more parents

� No training data

� Why?� Understandable

Page 4: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

Current methods

� Polythetic clustering

Battery California Technology Mile StateD1 X X X XD2 X X X X XD3 X X XD4 X X

Page 5: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

An alternative?

� Monothetic clustering

� Clusters based on a single features

� More ‘Yahoo/Dewey decimal’ like?

� Easier to understand?» Preferable to users?

� What about hierarchies of clusters?

Battery California Technology Mile StateD1 X X X XD2 X X X X XD3 X X XD4

Page 6: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

How to arrange cluster terms?

� Existing techniques� WordNet

» earthquake, volcano (eruption?)

� Key phrases (Hearst 1998)» “such as”, “especially”

� Phrase classification (Grefenstette 1997)» NP head or modifier “types of research” from “research things”

� Hierarchical phrase analysis (Woods 1997)» Head modifier again, “car washing” under “washing”, not “car”

Page 7: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

WordNet (aside)

� 1 sense of earthquake, sense 1

� earthquake, quake, temblor, seism -- (shaking and vibration at the surface of the earth resulting from underground movement along a fault plane of from volcanic activity)

» geological phenomenon -- (a natural phenomenon involving the structure or composition of the earth)

» natural phenomenon, nature -- (all non-artificial phenomena)

» phenomenon -- (any state or process known through the senses rather than by intuition or reasoning)

Page 8: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

WordNet (aside)

� 5 senses of eruption, sense 1

� volcanic eruption, eruption -- (the sudden occurrence of a violent discharge of steam and volcanic material)

» discharge -- (the sudden giving off of energy)

» happening, occurrence, natural event -- (an event that happens)

» event -- (something that happens at a given place and time)

Page 9: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

Start with something simpler?

� Term clustering?� simple monothetic clusters

� No ordering.

Page 10: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

Use subsumption

� Initially using subsumption.� Finds related terms

� Decides which is more general, which is more specific (idf?)

� Strict interpretation� X s Y iff P(x|y) = 1, P(y|x) < 1

� In practice� X s Y iff P(x|y) > 0.8, P(y|x) < 1

� P(x|y) > 0.8, P(y|x) < P(x|y)

xy

x

y

Page 11: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

How to build a “hierarchy”

� X s Y

� X s Z

� X s M

� X s N

� Y s Z

� A s B

� A s Z

� B s Z

X

Y

Z

M N

A

B

really it’s a DAG

Page 12: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

How to display it?

� DAGs were big� Unlikely to get all on screen

� Only want to see current focus plus route to taken there?

� Use a method users are familiar with

� Hierarchical menus

X

Y

Z

M N

A

B

Z

Page 13: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

What about ambiguity?

� Monothetic clusters of ambiguous terms?

� Derive hierarchy from retrieved documents� Take a query and retrieve on it,

� take top 500 documents,

� build hierarchy from them.

� Topics/concepts are words/phrases taken from� Query

� Retrieved documents

� Comparison of frequencies

Page 14: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Page 15: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Page 16: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Page 17: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Page 18: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Page 19: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Page 20: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Page 21: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Page 22: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Page 23: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Page 24: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Page 25: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Page 26: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Page 27: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Page 28: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Page 29: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Page 30: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Page 31: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Page 32: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Page 33: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Page 34: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Page 35: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

Did you guess the paper?

� Bit like Peter Anick’s work?

Page 36: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

Experiment

� Test properties of hierarchy

� Does it mimic (in some way) Yahoo-like categories?� Parent related to child?

� Parent more general than child?

Page 37: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

Experimental set-up

� Gathered eight subjects� Presented subsumption categories and ‘random’ categories.

� Ask if parent child pair are ‘interesting’.» If yes, then what type is relationship, (roughly) from WordNet

» Aspect of

» Type of

» Same as

» Opposite of

» Don’t know

Page 38: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

Results

� Question of parent/child pairing ‘interesting’ or not� Random, 51%

� Subsumption, 67%

� Difference significant from t-test, p<0.002

� If interesting, what is parent/child type?

Odd?

Page 39: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

Yahoo categories?

Page 40: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

Results and conclusions

� Interesting AND (aspect of OR type of)� Random, 28% (51% * (47% + 8%))

� Subsumption, 48% (67% * (49% + 23%))

� Appears that subsumption and an ordering based on document frequency does a reasonable job.� Term frequency work see.

» Sparck Jones, K. (1972) A statistical interpretation of term specificity and its application in retrieval, in Journal of Documentation, 28(1): 11-21

» Caraballo, S.A., Charniak, E. (1999) Determining the specificity of nouns from text, in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP):

Page 41: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,

Mark Sanderson, University of Sheffield

Future work?

� More user studies.

� Incorporate other term relationship techniques

� Other visualisations

� Application of techniques to whole document collections.

� Presentation of Cross Language IR results?