lecture: semantic word clouds
TRANSCRIPT
Seman&c Analysis in Language Technology http://stp.lingfil.uu.se/~santinim/sais/2014/sais_2014.htm
Semantic Word Clouds
Marina San(ni [email protected]
Department of Linguis(cs and Philology Uppsala University, Uppsala, Sweden
Autumn 2014
1 Lect 10: Seman(c Word Clouds
Acknowledgements
• Some slides borrowed from Sergey Pupyrev.
Lect 10: Seman(c Word Clouds 2
Outline
• Word Clouds • 3 early algorithms • 3 new algorithms • Metrics & Quan(ta(ve Evalua(on
Lect 10: Seman(c Word Clouds 3
Word Clouds
• Word clouds have become a standard tool for abstrac(ng, visualizing and comparing texts…
• We could apply the same or similar techniques to the huge amonts of tags produced by users interac(ng in the social networks
Lect 10: Seman(c Word Clouds 4
Comparison & conceptualiza(on Tool
Lect 10: Seman(c Word Clouds 5
• Word Clouds as a tool for ”conceptualizing” documents. Cf Ontologies
• Ex: 2008, comparison of speeches: Obama vs McCain
Word Clouds and Tag Clouds…
• … are oVen used to represent importance among terms (ex, band popularity) or serve as a naviga(on tool (ex, Google search results).
Lect 10: Seman(c Word Clouds 6
The Problem…
• How to compute seman(c-‐preserving word clouds in which seman(cally-‐related words are close to each other.
Lect 10: Seman(c Word Clouds 7
Wordle h^p://www.wordle.net
• Prac(cal tools, like Wordle, make word cloud visualiza(on easy.
• Shortoming: they do not capture the rela(onships between words in any way
Lect 10: Seman(c Word Clouds 8
Many word clouds are arranged randomly (look also at the sca^ered colours)
Lect 10: Seman(c Word Clouds 9
Seman(c Pa^erns
• Humans ins(nc(vely tend to pick up pa^erns
• Ins(nc(vely, one could say that two words that are close to each other in a word cloud are seman(cally related.
Lect 10: Seman(c Word Clouds 10
So, it makes sense to place such related words close to each other (look also at the color distribu(on)
Lect 10: Seman(c Word Clouds 11
In linguis(cs and in LT…
• … if a pair of words oVen appear together in a sentence, then we can assume that this pair of words is related seman(cally.
Lect 10: Seman(c Word Clouds 12
Seman(c word clouds have higher user sa(sfac(on compared to other layouts…
Lect 10: Seman(c Word Clouds 13
All recent word cloud visualiza(on tools aim to incoprorate seman(cs in the layout…
Lect 10: Seman(c Word Clouds 14
… but none of them provide any guarantee about the quality of the layout in terms of seman(cs
Lect 10: Seman(c Word Clouds 15
Early algorithms: Force-‐Directed Graph
• Most of the exis(ng algorithms are based on force-‐directed graph layout.
• Force-‐directed graph drawing algorithms are a class of algorithms for drawing graphs in an aesthe(cally pleasing way
– A^rac(ve forces between pairs to reduce empty space
– Repulsive forces ensure that words do not overlap
– Final force preserve seman(c rela(ons between words.
Lect 10: Seman(c Word Clouds 16
Force-‐directed graph drawing algorithms assign forces among the set of edges and the set of nodes of a graph drawing. Typically, spring-‐like a^rac(ve forces based on Hooke's law are used to a^ract pairs of endpoints of the graph's edges towards each other, while simultaneously repulsive forces like those of electrically charged par(cles based on Coulomb's law are used to separate all pairs of nodes.
Newer Algorithms: rectangle representa(on of graphs
• Vertex-‐weighted and edge-‐weighed graph: – The ver(ces of the graph are the words
• Their weight correspond to some measure of importance (eg. word frequencies)
– The edges capture the seman(c relatedness of pair of words (eg. co-‐occurrence) • Their weight correspond to the strength of the rela(on
– Each vertex can be drawn as a box (rectangle) with a dimension determing by its weight
– A realized adjacency is the sum of the edge weights for all pairs of touching boxes.
– The goal is to maximize the realized adjacencies.
Lect 10: Seman(c Word Clouds 17
Experimental Setup: 1) Term Extrac(on 2) Ranking 3) Similarity Conputa(on
Lect 10: Seman(c Word Clouds 18
Early Algorithms
1. Wordle (Random) 2. Context-‐Preserving Word Cloud Visualiza(on
(CPWCV) 3. Seam Carving
Lect 10: Seman(c Word Clouds 19
Wordle à Random
• The Wordle algorithm places one word at a (me in a greedy fashion, aiming to use space as efficiently as possible.
• First the words are sorted by weight in decreasing order.
• Then for each word in the order, a posi(on is picked at random.
Lect 10: Seman(c Word Clouds 20
1: Random
Lect 10: Seman(c Word Clouds 21
2: Random
Lect 10: Seman(c Word Clouds 22
3: Random
Lect 10: Seman(c Word Clouds 23
4: Random
Lect 10: Seman(c Word Clouds 24
5: Random
Lect 10: Seman(c Word Clouds 25
6: Random
Lect 10: Seman(c Word Clouds 26
Context-‐Preserving Word Cloud Visualiza(on (CPWCV)
• First, a dissimilarity matrix is computed and Mul(dimensional Scaling (MDS) is performed
• Second, effort to create a compact layout
Lect 10: Seman(c Word Clouds 27
Mul(dimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a dataset.
1: Context-‐Preserving
Lect 10: Seman(c Word Clouds 28
2: Context-‐Preserving : repulsive force
Lect 10: Seman(c Word Clouds 29
3: Context-‐Preserving : a^rac(ve force
Lect 10: Seman(c Word Clouds 30
Seam Carving
• Seam carving is a content-‐aware image resizing technique
• Basically, an algorithm for image resizing
• It was invented at Mitsubishi’s
Lect 10: Seman(c Word Clouds 31
1: Seam Carving
Lect 10: Seman(c Word Clouds 32
2: Seam Carving : space is divided into regions
Lect 10: Seman(c Word Clouds 33
3: Seam Carving : empty paths trimmed out itera(vely
Lect 10: Seman(c Word Clouds 34
4: Seam Carving
Lect 10: Seman(c Word Clouds 35
5: Seam Carving
Lect 10: Seman(c Word Clouds 36
6: Seam Carving: space divided into regions
Lect 10: Seman(c Word Clouds 37
7: Seam Carving
Lect 10: Seman(c Word Clouds 38
3 New Algorithms
1. Inflate and Push 2. Star Forest 3. Cycle Cover
Lect 10: Seman(c Word Clouds 39
Inflate-‐and-‐Push
• Simple heuris(c method for word layout, which aims to preserve seman(c rela(ons between pair of words.
Lect 10: Seman(c Word Clouds 40
1: Inflate
Lect 10: Seman(c Word Clouds 41
2: Inflate : scaling down
Lect 10: Seman(c Word Clouds 42
3: Inflate : seman(cally-‐related words are placed close to each other
Lect 10: Seman(c Word Clouds 43
4: Inflate : repulsive force to resolve overlaps
Lect 10: Seman(c Word Clouds 44
5: Inflate
Lect 10: Seman(c Word Clouds 45
Star Forest
• A star is a tree and a star forest is a forest whose connected components are all stars.
Lect 10: Seman(c Word Clouds 46
Star Forest : star = graph • Dissimilarity matrix à disjoint stars = star forest • A^rac(ve force to get a compact layout
Lect 10: Seman(c Word Clouds 47
Cycle Cover • This algorithm is based on a similarity matrix. • First, a similarity path(=cycle) is created • Then, the op(mal level of compact-‐ness is computed
Lect 10: Seman(c Word Clouds 48
Quan(ta(ve Metrics
Lect 10: Seman(c Word Clouds 49
Criteria 1. Realized Adjacenies – how close are similar words to each other?
2. Distor(on – how distant are dissimilar words?
3. Comptactness – how well u(lized is the drawing area?
4. Uniform Area U(liza(on – uniformity of the distribu(on (overpopulated vs sparse areas
in the word cloud) 5. Aspect Ra(o – width and height of the bounding box
6. Running Time – execu(on (me
Lect 10: Seman(c Word Clouds 50
2 datasets
(1) WIKI , a set of 112 plain-‐text ar(cles extracted from the English Wikipedia, each consis(ng of at least 200 dis(nct words (2) PAPERS , a set of 56 research papers published in conferences on experimental algorithms (SEA and ALENEX) in 2011-‐2012.
Lect 10: Seman(c Word Clouds 51
Cycle Cover wins
Lect 10: Seman(c Word Clouds 52
Seam Carving wins
Lect 10: Seman(c Word Clouds 53
Random wins
Lect 10: Seman(c Word Clouds 54
Inflate wins
Lect 10: Seman(c Word Clouds 55
Random and Seam Carving win
Lect 10: Seman(c Word Clouds 56
All ok except Seam Carving
Lect 10: Seman(c Word Clouds 57
Demo
Lect 10: Seman(c Word Clouds 58
Final Words
Lect 10: Seman(c Word Clouds 59
The end
60 Lect 10: Seman(c Word Clouds