Download - geolocation twitter network text geotagging
Twitter User Geolocation Using a Unified Text andNetwork Prediction Model
Afshin Rahimi, Trevor Cohn and Timothy BaldwinDepartment of Computing and Information Systems, The University of Melbourne
OVERVIEW
Task: Where does @ShvwnK live?
Input: user, concatenated tweet text, mention-list
Output: latitude/longitude(known for training users, predicted for test users)
Datasets: 3 Twitter geolocation datasets (#users in parenthesis)GeoText (9.5K), Twitter-US (450K) and Twitter-World (1.4M).
TEXT-BASED MODEL
Logistic regression with l1 regularisationover k-d tree discretisation of latitude/longitude.
top features of NYC use of “upstate” in U.S.
NETWORK-BASED MODEL
Label propagation in a collapsed network:
• Build the graph using @-mentions.
• Use training nodes as seed (labelled samples).
• Infer the test labels by Modified Adsorption (Taluk-dar and Crammer, 2009).
argminY
c(Y ) =∑l
[µ1
Match seed︷ ︸︸ ︷(Yl − Yl)
TS(Yl − Yl) + µ2 Y Tl LYl︸ ︷︷ ︸
Smooth labels
]
0.7 0.5
0.01
new label estimate
FROM @-MENTION TO COLLAPSED NETWORK
@-mention Network Collapsed Network + Text Dongle Nodes
labelled nodes
unlabelled nodes
mentioned nodes
text dongle nodes
celebrity
UNIFIED MODEL: NETWORK & TEXT
• For connected users, Network-based models aremore accurate.
• For disconnected users (about 20% of the nodes),text-based models are more accurate.
• Solution: Utilise both text and network!
• For each test node, attach a text dongle node car-rying text-based predictions.
• Add the text dongle nodes to seed nodes (like train-ing nodes).
• Use Modified Adsorption to infer the labels.
“CELEBRITIES” DON’T GEOLOCATE
• “Celebrities” (highly mentioned users) areconnected from everywhere.
• They connect lots of people.
• Solution: Remove users with more than T mentions.
• Results in sparser graphs (tractable inference)and more accurate geolocation.
TUNING T (TWITTER-US)
2 5 15 50 500 5kCelebrity threshold T (# of mentions)
700
720
740
760
780
800
820
840
860
Mea
n er
ror (
in k
m)
Mean errorGraph size
105
106
107
108
109
Grap
h si
ze (#
edg
es)
Decreasing T results in: sparser graph, lower mean error.
RESULTS
State of the art results over all three datasets!
GEOTEXT TwitterUS TwitterWorld
600
800
1000
1200
1400
1600
Mea
n Er
ror (
km)
Network-based Model (This work)Unified Model (This work)Network-based: Rahimi et al. (NAACL2015)Text-based: Rahimi et al. (NAACL2015)Text-based: Wing and Baldrige (EMNLP2014)Text-based: Cha et al. (ICWSM2015)
larger dataset−−−−−−−−−→