pinpointing location focus in microblogs

23
Pinpointing Locational Focus in Microblogs Jie Yin, Sarvnaz Karimi, John Lingad November 2014 DIGITAL PRODUCTIVITY FLAGSHIP

Upload: sarvnaz-karimi

Post on 05-Aug-2015

46 views

Category:

Presentations & Public Speaking


0 download

TRANSCRIPT

Pinpointing Locational Focus in Microblogs

Jie Yin, Sarvnaz Karimi, John LingadNovember 2014

DIGITAL PRODUCTIVITY FLAGSHIP

CSIRO: positive impact | Presentation title | Presenter name

Where is it happening?

For those monitoring social media to• send help in emergency• avoid certain area(e.g., for traffic)• recommend services (ads)

2 |

CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi3 |

Find it on the map!

Locational focus

CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi4 |

Locational focus: Macquarie Centre, North Ryde, New South Wales, Australia

Location mentions: Sydney, Macquarie CentreAuthor Location: Brisbane, Australia

Some tweets mention multiple locations: Not easy to identify the focus

CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi5 |

Mary river, Queensland, Australia

Mary river, Queensland, Australia

Some tweets have no locational focus

CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi6 |

There is an unknown location (Ambiguity)

No specific focus(World Level?)

To find locational focus, we have two tasks:1. Find mentions of locations2. Aggregate these to infer the main focus

CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi7 |

Finding location mentions

1. Where to look for the location mentions• Location mentions can be in Tweet text and or in hashtags• Some hashtags are concatenated words or abbreviations, e.g., #QLDflood =

QLD + flood• Tweet texts may mention a geographical location, such as Sydney, or a Point-

of-Interest (POI) such as an organisations name or a shop• Authors’ locations in their profile (not exactly a location mention)

2. How to find these mentions?• Hashtag segmentation• Named Entity Recognition

CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi8 |

Location mention extraction

• Related work• NER tools for formal text, such Stanford NER and OpenNLP, are highly

accurate (solved problem).• NER specific for Twitter: TwiNER [Wang et al.,2012], TwitterNLP [Ritter et al.,

2011]• Retrained NER tools for Twitter [Lingad et al., 2013] – Location and

Organisation entities only

• In this work: • Segmented the hashtags using a simple greedy maximal matching heuristics:

Used an English dictionary augmented with place name abbreviations• Used retrained Stanford NER, and used LOC and ORG

CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi9 |

Inferring locational focus

Given a list of location mentions, determine what the focus is.

For example:

If mentions are VIC,NSW,QLD,WA then focus is AustraliaIf mentions are Swanston St, RMIT then focus is RMIT University,

Melbourne, VIC, Australia

• Requires knowledge of the geographical locations as well as POIs and their relationships/hierarchy.• Gazetteer Australia 2010, GeoNames New Zealand, OpenStreetMap

CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi10 |

Gazetteer as a tree

CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi11 |

Specific POI

City/Suburb/Town/Non-Specific POI (e.g., river, highway)

State/Territory/Region

Country

Inference algorithm: Where on the map?

• Step 1: Query location mentions from the gazetteer, and return matching (partial or exact match) results in full path in the gazetteer tree

• Step 2: Create an inference tree using the returned paths• Step 3: Propagate the scores in the tree• Step 4: Find a maximum scoring path

CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi12 |

Goal: Finding the lowest granularity possibleAssumption: More possible matches found within a geographical region indicates that region on the map is more likely the focus

Querying the gazetteer tree

• Location mentions: Sydney, Macquarie Centre• Author Location: Brisbane, Australia

• Gazetteer querying returns: - brisbane, queensland, australia - south brisbane, queensland, australia- macquaire centre, north ryde, new south wales, australia- macquaire university, macquaire park, new south wales, australia- ...

Each of these returned results get a matching score based on Jaccard similarity of the query and the matched node.

CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi13 |

Building the inference tree

CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi14 |

earth

australia

queensland

brisbane Leaf Score

brisbane, queensland, australiamacquaire centre, north ryde, new south wales, australia

new south wales

north ryde

macquaire centreLeaf Score

Propagating scores to the parents and finding the maximal path

• More branches within a sub-tree increase the chance of their parent to be in the maximal path

• Bottom-up scoring of parents from leaves to the root• Parent score = current score + 0.5*score of the highest scoring

child

• Top-down selection of the maximal path based on entropy as the termination condition. If entropy of children scores are higher than a pre-defined threshold, the algorithm stops at that level.

CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi15 |

CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi16 |

earth

australia

queensland

brisbane

brisbane, queensland, australiamacquaire centre, north ryde, new south wales, australiaSydney, new south wales, australia

new south wales

north ryde

macquaire centre

sydneyA

0.5*A

B

D

macquaire University

C

0.5*Max(B,C)

0.5*Max(0.5*B,D)

Leaf score= w*2^level*Jaccard similarity

Dataset & annotation

• Queried Twitter with keywords such as fire, earthquake, storm, hurricane• Randomly sampled 7,000 tweets• Two annotation steps:

1. Indentify location mentions2. Identify locational focus (based on tweet and author location)• Three annotators per tweet, only tweets with majority agreement

(2 out of 3) were kept in the final set.• Tweets that their focus was not within Australia and New Zealand

were removed.• There was a small set of tweets that were marked as impossible to

detect the focus which were removed.• Final set: 1398 tweets (80 kept for parameter tuning)

CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi17 |

Baseline: Yahoo! PlaceFinder*

• A service that accepted queries and returned a list of matching places in the form of country, state, city, poi

• A query to the service was similar to a database querying: SELECT * FROM geo.placefinder WHERE text = query text

And we chose the query text to be(a) tweet (text & hashtag) and user location from their profile(b) the list of location mentions from one tweet (human annotated)

CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi18 |

* As it was called in Jan 2014

Accuracy with manual location mentions (without NER)

All Text Hashtag User Location

Level 1 - Country 89.9 35.3 45.2 71.6

Level 2 - State 73.5 29.3 37.4 36.3

Level 3 - City/Suburb 51.0 24.5 12.4 4.9

Level 4 - POI 29.7 11.7 8.1 1.8

No focus 58.5 95.8 96.4 63.2

CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi19 |

User location was most useful in the county level, but did not contribute much in other levels of granularity.POI was hardest with only ~30% were correctly identified.

All = 0.6 text + 0.3 hashtag + 0.1 user location

Accuracy with location mentions extracted using NER

Level 1 Level 2 Level 3 Level 4 No Focus

PlaceFinder (a) 87.9 58.6 22.9 21.0 0.3

PlaceFinder (b) 87.8 59.1 23.5 18.8 25.5

Our Alg. No NER 89.9 73.5 51.0 29.7 58.5

Our Alg. With NER 91.3 65.7 47.0 24.9 53.4

CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi20 |

(a) The whole tweet was queried (b) location mentions were queries (no NER)

Country level focus was the easiest with all settings performing similar.PlaceFinder was consistently worse in other levels, but that could also be the effect of our gazetteer hierarchy.

The sources of errors in our algorithm

• Annotation mistakes: human annotators missed some of the mentions.• Missing some of the street and POIs in the gazetteer.• Heavily misspelled place named that were not corrected

in our pre-processing step.• We favoured lower granularities in our scoring, which

introduced wrong POIs that were not needed.• Gazetteer bias: if one mention had many matches in a

region, the path could wrongly get stronger.

CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi21 |

What we learnt and what’s next?

• Finding locational focus is difficult, even for human (low agreement in annotation)• Our method was accurate (90%) at country level, but accuracy

dropped for state, city, and POI levels (29%).• All three information sources (text, hashtag, and author location)

contribute in finding focus, but in different levels.

• How to make it better?• Incorporate some context, e.g., tweets that share hashtags, replies,

temporally close• Learning the weights of different information sources

CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi22 |

CSIRO: positive impact | Presentation title | Presenter name

Related Studies

• Twitterstand: geotagging content of tweets. Used GeoNames gazetteer and heuristic rules to find and disambiguate the location focus.• Kinsella et al [2011]: learning language model of locations

23 |